include-dcc / include-linkml

LinkML Schema for INCLUDE DCC
https://include-dcc.github.io/include-linkml/
MIT License
3 stars 4 forks source link

Error message when using validator #197

Open lopierra opened 2 weeks ago

lopierra commented 2 weeks ago

Hi @madanucd - I'm attempting to run the validator on a test dataset:

validate-data -o ./errorlogs ./ABC-DS.csv participant

but I get the following error message:

Traceback (most recent call last):
  File "C:\Users\lopi\AppData\Local\pypoetry\Cache\virtualenvs\src-8-5-hlTp-py3.12\Scripts\validate-data", line 6, in <module>
    sys.exit(main())
             ^^^^^^
  File "C:\Users\lopi\OneDrive - The University of Colorado Denver\Documents\R_linkml\src\data_validation\cli.py", line 36, in main
    validation_function(args.input_file, args.output)
  File "C:\Users\lopi\OneDrive - The University of Colorado Denver\Documents\R_linkml\src\data_validation\validation.py", line 20, in validate_participant
    return validate_data(file_path, string_columns, validate_participant_entry, output_path)
           ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
  File "C:\Users\lopi\OneDrive - The University of Colorado Denver\Documents\R_linkml\src\data_validation\validation_utils.py", line 54, in validate_data
    clean_dataframe_strings(df, string_columns)
  File "C:\Users\lopi\OneDrive - The University of Colorado Denver\Documents\R_linkml\src\data_validation\validation_utils.py", line 16, in clean_dataframe_strings
    df[string_columns] = df[string_columns].map(clean_string)
                         ^^^^^^^^^^^^^^^^^^^^^^
  File "C:\Users\lopi\AppData\Local\pypoetry\Cache\virtualenvs\src-8-5-hlTp-py3.12\Lib\site-packages\pandas\core\generic.py", line 5989, in __getattr__
    return object.__getattribute__(self, name)
           ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
AttributeError: 'DataFrame' object has no attribute 'map'. Did you mean: 'max'?

Am I doing something wrong, or is it an issue with the validator?

Not urgent - we can discuss next Tuesday at Data Modeling meeting. Thanks!

madanucd commented 1 week ago

It seems the map function is executing on my local machine, but typically, map cannot be directly applied to a DataFrame. To ensure consistency and correctness, we should update it to use applymap or apply which are the appropriate methods for applying functions to DataFrame elements. I will be preparing a PR to make this adjustment.

madanucd commented 5 days ago

Hi Pierrette,

I wanted to bring to your attention that the applymap function has been deprecated for pandas versions after 2.1.0. You can find more details in the pandas documentation here. It was working for me because my pandas version is 2.2.0.

We could switch to using applymap as suggested in earlier versions of pandas. However, please note that with future pandas updates, it might not work.

Could you please try updating your pandas version? This should resolve the issue.

Thank you!

lopierra commented 3 days ago

@madanucd I updated pandas and got a bit further with the validator. I ran it on the same file that I sent you before (ABC-DS.csv) and got the expected validation errors, but also got a TypeError. Is this expected? (Maybe due to ABC-DS having IDs that are integers instead of strings?)

validate-data -o ./errorlogs ./ABC-DS.csv participant

Validating participant data from file: ./ABC-DS.csv
Traceback (most recent call last):
  File "C:\Users\lopi\OneDrive - The University of Colorado Denver\Documents\R_linkml\src\data_validation\validate_participant.py", line 7, in validate_participant_entry
    instance = Participant(
               ^^^^^^^^^^^^
  File "C:\Users\lopi\AppData\Local\pypoetry\Cache\virtualenvs\src-8-5-hlTp-py3.12\Lib\site-packages\pydantic\main.py", line 192, in __init__
    self.__pydantic_validator__.validate_python(data, self_instance=self)
pydantic_core._pydantic_core.ValidationError: 7 validation errors for Participant
participantExternalId
  Input should be a valid string [type=string_type, input_value=10001, input_type=int]
    For further information visit https://errors.pydantic.dev/2.8/v/string_type
familyId
  Input should be a valid string [type=string_type, input_value=nan, input_type=float]
    For further information visit https://errors.pydantic.dev/2.8/v/string_type
fatherId
  Input should be a valid string [type=string_type, input_value=nan, input_type=float]
    For further information visit https://errors.pydantic.dev/2.8/v/string_type
motherId
  Input should be a valid string [type=string_type, input_value=nan, input_type=float]
    For further information visit https://errors.pydantic.dev/2.8/v/string_type
siblingId
  Input should be a valid string [type=string_type, input_value=nan, input_type=float]
    For further information visit https://errors.pydantic.dev/2.8/v/string_type
otherFamilyMemberId
  Input should be a valid string [type=string_type, input_value=nan, input_type=float]
    For further information visit https://errors.pydantic.dev/2.8/v/string_type
ageAtLastVitalStatus
  Input should be a finite number [type=finite_number, input_value=nan, input_type=float]
    For further information visit https://errors.pydantic.dev/2.8/v/finite_number

During handling of the above exception, another exception occurred:

Traceback (most recent call last):
  File "C:\Users\lopi\AppData\Local\pypoetry\Cache\virtualenvs\src-8-5-hlTp-py3.12\Scripts\validate-data", line 6, in <module>
    sys.exit(main())
             ^^^^^^
  File "C:\Users\lopi\OneDrive - The University of Colorado Denver\Documents\R_linkml\src\data_validation\cli.py", line 36, in main
    validation_function(args.input_file, args.output)
  File "C:\Users\lopi\OneDrive - The University of Colorado Denver\Documents\R_linkml\src\data_validation\validation.py", line 20, in validate_participant
    return validate_data(file_path, string_columns, validate_participant_entry, output_path)
           ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
  File "C:\Users\lopi\OneDrive - The University of Colorado Denver\Documents\R_linkml\src\data_validation\validation_utils.py", line 55, in validate_data
    valid_count, invalid_count = validate_dataframe(df, validation_function, input_file_name=file_name,
                                 ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
  File "C:\Users\lopi\OneDrive - The University of Colorado Denver\Documents\R_linkml\src\data_validation\validation_utils.py", line 20, in validate_dataframe
    validation_results = df.apply(entry_validator, axis=1)
                         ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
  File "C:\Users\lopi\AppData\Local\pypoetry\Cache\virtualenvs\src-8-5-hlTp-py3.12\Lib\site-packages\pandas\core\frame.py", line 10374, in apply
    return op.apply().__finalize__(self, method="apply")
           ^^^^^^^^^^
  File "C:\Users\lopi\AppData\Local\pypoetry\Cache\virtualenvs\src-8-5-hlTp-py3.12\Lib\site-packages\pandas\core\apply.py", line 916, in apply
    return self.apply_standard()
           ^^^^^^^^^^^^^^^^^^^^^
  File "C:\Users\lopi\AppData\Local\pypoetry\Cache\virtualenvs\src-8-5-hlTp-py3.12\Lib\site-packages\pandas\core\apply.py", line 1063, in apply_standard
    results, res_index = self.apply_series_generator()
                         ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
  File "C:\Users\lopi\AppData\Local\pypoetry\Cache\virtualenvs\src-8-5-hlTp-py3.12\Lib\site-packages\pandas\core\apply.py", line 1081, in apply_series_generator
    results[i] = self.func(v, *self.args, **self.kwargs)
                 ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
  File "C:\Users\lopi\OneDrive - The University of Colorado Denver\Documents\R_linkml\src\data_validation\validate_participant.py", line 31, in validate_participant_entry
    error_details = (row['Study Code'] + "-" + row['Participant External ID'], e)
                     ~~~~~~~~~~~~~~~~~~~~~~~~^~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~
TypeError: can only concatenate str (not "int") to str