nyuad-cai / MedFuse

73 stars 20 forks source link

Missing ICD 10 codes, How are they being handled? #17

Closed FrozenWolf-Cyber closed 1 month ago

FrozenWolf-Cyber commented 4 months ago

There seems to be some ICD 10 in the MIMIC IV dataset that isn't available in your yaml script, and for such cases, print(f'{code} code not found')is being triggered. So, does that mean those diagnoses aren't considered?

Example ICD 10 code from MIMIC-IV:
J9811   -   Atelectasis
ShazaElsharief commented 1 month ago

I am assuming you are referring to the ‘create_phenotyping.py’ script. To clarify, this task is created specifically for a set of 25 chronic, mixed, and acute care conditions. Atelectasis, for example, is not considered as one of these diagnosis. Please refer to Table 4 in the paper for the full list of phenotypes. This is how the creation of this task is handled:

 # read the diagnosis file for the patient 
 diagnoses_df = pd.read_csv(os.path.join(patient_folder, "diagnoses.csv"),
                                           dtype={"icd_code": str})
 # only include diagnoses for the current ICU stay
 diagnoses_df = diagnoses_df[diagnoses_df.stay_id == icustay]
 # iterate through the diagnosis in the file 
 for index, row in diagnoses_df.iterrows():
 # check the 'use_in_benchmark' flag 
    if row['USE_IN_BENCHMARK']:
       # read the icd code for the current diagnosis 
        code = row['icd_code']
        # add the diagnosis label if the code is in the yaml definition file 
        if code in code_to_group:
            group = code_to_group[code]
            group_id = group_to_id[group]
            cur_labels[group_id] = 1
        # if the code does not exist in code_to_groups, it is not used in the benchmark and is not considered  
         else:
             print(f'{code} code not found')  

In the above code snippet, each diagnosis is read from the diagnoses.csv file of a patient. Afterwards, the ‘use_in_benchmark’ flag for the current diagnosis, which is only set to ‘1.0’ for the 25 phenotypes, is checked. If it is set to '0.0', this means it is a diagnosis aside form the 25 phenotypes and is not considered. In this case, if 'Atelectasis' existed and the use_in_benchmark flag is set to ‘0.0’, it would not be considered and the yaml definitions file would not be checked at all.

However, the ‘use_in_benchmark’ column is empty in some diagnosis files. In the case that the flag is read as ‘nan’, then further checks are needed. Here, the diagnosis is checked against the codes in the yaml definition file, as all 25 phenotypes and their codes are there. If the code exists in the code_to_group dictionary, then it is added. Else, this means the code is not used in the benchmark as part of this task and 'code not found' is printed. So to clarify, all ICD 10 codes for the 25 phenotypes considered are present in the yaml definitions file, but others may not be. Please check the ‘use_in_benchmark' flag in the ‘icd_9_10_definitions_2.yaml’ to identify them.