lrsoenksen / HAIM

This repository contains the code to replicate the data processing, modeling and reporting of our Holistic AI in Medicine (HAIM) Publication in Nature Machine Intelligence (Soenksen LR, Ma Y, Zeng C et al. 2022).
Apache License 2.0
104 stars 27 forks source link

Some files in the repo are missing in MIMIC datasets? #11

Closed AliRasekh closed 1 year ago

AliRasekh commented 1 year ago

Hello,

First, Thanks for your great project and code.

While running the 1_Generate_HAIM-MIMIC-MM file. I got this error:

FileNotFoundError: [Errno 2] No such file or directory: './data/HAIM/physionet/files/mimiciv/1.0/mimic-cxr-jpg/2.0.0/mimic-cxr-2.0.0-jpeg-txt.csv'

The path and other things are correct and I have downloaded and extracted the following datasets as mentioned: https://physionet.org/content/mimiciv/1.0/ https://physionet.org/content/mimic-cxr-jpg/2.0.0/

But in the second link, there is no file named "mimic-cxr-2.0.0-jpeg-txt.csv". How can I access that? And is the MIMIC-CXR version the same that you ran your code on it?

Thanks in advance

lrsoenksen commented 1 year ago

UPDATE (Jun. 12, 2023) For the publication, our team generated the file 'mimic-cxr-2.0.0-jpeg-txt.csv' by compiling an early-release version of participant notes and text from the images in CXR corresponding to MIMIC-IV. We wanted to add these to this repository, but the data policy from PhysioNet.org states we cannot directly share this compiled data via Git Hub. Physionet is the only one with permission to do so or subsets of the data. This means users need to generate their own mimic-cxr-2.0.0-jpeg-txt.csv based on the released notes and CXR files from Physionet.org once all notes are released. The dataset structure can be inferred from the code. As of June 12, 2023, Physionet has not fully released these notes, but it is likely they are planning to do so as part of their full release of MIMIC-IV. We are very sorry for any inconvenience this may cause.

AliRasekh commented 1 year ago

Thanks for your response.

Then one question. If I comment the following lines, does it cause any problem in generating the multimodal data?

#     # Add paths and info to images in cxr
    df_mimic_cxr_jpg =pd.read_csv(core_mimiciv_path + 'mimic-cxr-jpg/2.0.0/mimic-cxr-2.0.0-jpeg-txt.csv')
    df_cxr = pd.merge(df_mimic_cxr_jpg, df_cxr, on='dicom_id')
    # Save
    df_cxr.to_csv(core_mimiciv_path + 'mimic-cxr-jpg/2.0.0/mimic-cxr-2.0.0-metadata.csv', index=False)
    #Read back the dataframe
lrsoenksen commented 1 year ago

It should not cause problems, but the input data would not be the same as in the publication. If you don't use the notes of the images, then you may need to comment downstream lines during embedding generation so that the training algorithm is not expecting input text from there. Additionally, you can just make a mimic-cxr-2.0.0-jpeg-txt.csv all with the same text (empty notes).