sibis-platform / ncanda-data-integration

This is the Data Integration, MRI, and Bioinformatics Component of the National Consortium on Alcohol and NeuroDevelopment in Adolescence (NCANDA), funded by the NIAAA.
https://www.nitrc.org/projects/ncanda-datacore
BSD 3-Clause "New" or "Revised" License
4 stars 10 forks source link

Added catch for key error and error posting #520

Closed jodahoney closed 9 months ago

jodahoney commented 9 months ago

Link to error that it generates: https://github.com/sibis-platform/ncanda-operations/issues/14550 Post update:

ncanda@joe-pipeline-back[joe-pipeline_back_1]:/sibis-software/ncanda-data-integration/scripts/redcap$ ./import_mr_sessions -v --pipeline-root-dir /fs/ncanda-share/cases --study-id E-00140-M-1 -p
Namespace(event=None, force_update=False, force_update_stroop=False, max_days_after_visit=120, missing_only=False, no_stroop=False, no_upload=False, pipeline_root_dir='/fs/ncanda-share/cases', post_to_github=True, run_pipeline_script=None, site=None, study_id='E-00140-M-1', time_log_dir=None, verbose=True)
================================
== Setting up posting to GitHub 
Setting up GitHub...
Parsing config: None
Using Personal Access Token to authenticate.
Connected to GitHub
... ready!
Found label: [Label(name="import_mr_sessions")]
== Posting to GitHub is ready 
================================
Posting E-00140-M-1-recovery_baseline_arm_2 Subject exists in REDCap that is not apart of Arm 1.
Checking for issue: E-00140-M-1-recovery_baseline_arm_2, Subject exists in REDCap that is not apart of Arm 1.
Issue does not exist.
Created issue... See: https://api.github.com/repos/sibis-platform/ncanda-operations/issues/14549
Warning: Nothing to import !

Pre update:

ncanda@pipeline-back[pipeline_back_1]:/sibis-software/ncanda-data-integration/scripts/redcap (master)$ ./import_mr_sessions --pipeline-root-dir /fs/ncanda-share/cases -v --study-id E-00140-M-1
Namespace(event=None, force_update=False, force_update_stroop=False, max_days_after_visit=120, missing_only=False, no_stroop=False, no_upload=False, pipeline_root_dir='/fs/ncanda-share/cases', post_to_github=False, run_pipeline_script=None, site=None, study_id='E-00140-M-1', time_log_dir=None, verbose=True)
Traceback (most recent call last):
  File "/usr/local/lib/python3.8/site-packages/pandas/core/indexes/base.py", line 2898, in get_loc
    return self._engine.get_loc(casted_key)
  File "pandas/_libs/index.pyx", line 70, in pandas._libs.index.IndexEngine.get_loc
  File "pandas/_libs/index.pyx", line 101, in pandas._libs.index.IndexEngine.get_loc
  File "pandas/_libs/hashtable_class_helper.pxi", line 1675, in pandas._libs.hashtable.PyObjectHashTable.get_item
  File "pandas/_libs/hashtable_class_helper.pxi", line 1683, in pandas._libs.hashtable.PyObjectHashTable.get_item
KeyError: 'E-00140-M-1'

The above exception was the direct cause of the following exception:

Traceback (most recent call last):
  File "./import_mr_sessions", line 797, in <module>
    excluded_subjects = mr_sessions_redcap[ mr_sessions_redcap.index.map( lambda key: subject_data['exclude'][key[0]] == 1 ) ]['mri_xnat_sid'].dropna().tolist()
  File "/usr/local/lib/python3.8/site-packages/pandas/core/indexes/base.py", line 4797, in map
    new_values = super()._map_values(mapper, na_action=na_action)
  File "/usr/local/lib/python3.8/site-packages/pandas/core/base.py", line 1160, in _map_values
    new_values = map_f(values, mapper)
  File "pandas/_libs/lib.pyx", line 2403, in pandas._libs.lib.map_infer
  File "./import_mr_sessions", line 797, in <lambda>
    excluded_subjects = mr_sessions_redcap[ mr_sessions_redcap.index.map( lambda key: subject_data['exclude'][key[0]] == 1 ) ]['mri_xnat_sid'].dropna().tolist()
  File "/usr/local/lib/python3.8/site-packages/pandas/core/series.py", line 882, in __getitem__
    return self._get_value(key)
  File "/usr/local/lib/python3.8/site-packages/pandas/core/series.py", line 990, in _get_value
    loc = self.index.get_loc(label)
  File "/usr/local/lib/python3.8/site-packages/pandas/core/indexes/base.py", line 2900, in get_loc
    raise KeyError(key) from err
KeyError: 'E-00140-M-1'
kipohl commented 9 months ago

I do not understand why you would call

        index_values = err_row.index.get_level_values('study_id').tolist()
        index_values += err_row.index.get_level_values('redcap_event_name').tolist()

instead of getting this info directly from subject_data ? I am asking as if the subject is not in any arm your call will fail . Not sure the arm should be part of the error code given that it is independent of the arm

jodahoney commented 9 months ago

@kipohl

  1. I can remove the arm from the error code, just didn't know if we wanted some way to isolate out the error if there are multiple instances (present in more than one arm other than arm 1) of the error for the same subject, but that is an unlikely scenario.
  2. It seems to me that the cause of the error is that the study id doesn't exist in subject_data but it does exist in mr_sessions_redcap. So when it is trying to map every value in mr_sessions_redcap over subject_data, it throws the key error because it doesn't exist
    
    >>> subject_data.loc[subject_data.index.get_level_values('study_id') == 'E-00140-M-1']
    Empty DataFrame
    Columns: [dob, siblings_enrolled___true, siblings_id1, enroll_exception___drinking, exclude]
    Index: []
    >>> mr_sessions_redcap.loc[mr_sessions_redcap.index.get_level_values('study_id') == 'E-00140-M-
    1']
                                     visit_date  ...  redcap_data_access_group
    study_id    redcap_event_name                    ...                          
    E-00140-M-1 recovery_baseline_arm_2  2024-01-24  ...                       NaN

[1 rows x 7 columns]



Just to clarify what I am thinking, because I may be understanding this incorrectly, `mr_sessions_redcap` is just the filtered exported visit log of redcap, so if a subject has any sort of visit associated it gets pulled.
The point of `excluded_subjects` is to go through all those visits and check the related `subject_data` for whether they are excluded or not. This error is caused because there is a subject that has a visit, but has no `excluded` information in their `subject_data` because they don't have anything in Arm 1. 
kipohl commented 9 months ago

thanks for clarifying / you are right - so then lets leave as it