lrsoenksen / HAIM

This repository contains the code to replicate the data processing, modeling and reporting of our Holistic AI in Medicine (HAIM) Publication in Nature Machine Intelligence (Soenksen LR, Ma Y, Zeng C et al. 2022).
Apache License 2.0
104 stars 27 forks source link

Number of training samples #5

Closed ChantalMP closed 1 year ago

ChantalMP commented 1 year ago

Hi,

I'm currently trying to generate your dataset, however the number of embeddings I get does not match yours. I managed to create all 34537 pickle files. Then, as I understood, in "Generate Embeddings from Pickle Files" you iterate over all cxr images available within a patient stay and generate a row in the embedding csv for each of the images. For me this leads to over 125000 patients, however the embedding file you provided only has 45050 rows (which also matches the number of samples for mortality and discharge prediction you mention in your paper).

Do you have any idea what the issue could be? For example, do you use all images of a patient as single sample, including each view from the same study?

Thanks a lot in advance!

lrsoenksen commented 1 year ago

We do use all images from a patient up to the timestamp where we want to make a prediction (so multi-image). So we iterate over every HAIM ID, which again is a combination of Patient ID, Stay ID, and Hospitalization ID, which is the reason why there are 45K rows.

ChantalMP commented 1 year ago

Thanks for your fast reply!

would you happen to have the code where you do this?

The provided code iterates over all images per patient and the multi-image embedding seems to be computed using the current and all previous images. This still leads to an embedding number equal to the number of images, which is much more then 45k.

I also noticed in the provided embedding csv with 45k rows, that it only has around 8000 unique haim_ids. When I run the code on the 34K pickle files, there are around 19k haim patients with cxr images, leading to an embedding file with 125K rows including 19K unique haim ids. I assume this causes the discrepancy. Are there any haim patients of the 34K which you include for other reasons then not having an x-ray image during the stay?

lrsoenksen commented 1 year ago

Hi Chantal,

So we put all the code that we developed for the entirety of the project in the repo (so anything that we did should be there already). There may have been some bugs, but to our knowledge, all sampling and data processing was done quite carefully. I believe for many analyses. We only process embeddings for HAIM IDs where there is at least one image so that the multi-modality evaluation would make sense. Also, in many instances (but not all), a new image basically indicates a new time cut-off to produce an embedding. The embedding file that we provided has 45050 rows because downstream, I believe we sub-select for the clinical tasks that we were interested in the paper For example, for the 48 hr mortality, we excluded all embeddings and rows that probably corresponded to a patient staying in that visit less than 48 hrs. That itself may cut a bunch of rows. Also, patients with inconclusive diagnoses are also potentially excluded for the vision-heavy tasks. I'm sure you will find that the number of embeddings that we share will make sense once you go through the paper, code, and code notes carefully. We are currently working on a much-updated version of HAIM that we are changing from the ground up, so we are not supporting this specific code too much moving forward. I also encourage you to take our code make it your own (in a way) and improve it with the logic you think is most appropriated.

Warmest,

malika996 commented 1 year ago

Hi! I have the same problem with a slight difference: The number of generated pickle files is 34539, not 34537, which is different from the paper but similar to what you have in the corresponding notebook. So the raw number of haim_ids without additional sub-select is 34539. As I understand it, the uniqueness of haim_id is determined by stay_id.

Using the scripts to create the embedding file, I get a file with 125628 rows - 13816 unique patients and 19483 unique haim_ids.

Maybe there are indeed other criteria that you also used to sub-select patients? Maybe the is_included flag in Generate_Embeddings.py is not always True?

Additionally, 'cxr_ic_fusion_1103.csv' contains 45052 rows, not counting the header, but some of them are 'corrupted'. I get the following for read_csv : ParserError: Error tokenizing data. C error: Expected 6405 fields in line 45052, saw 7173, which forced me to use on_bad_lines = 'skip' to get 45050. Does this have anything to do with why we're getting different embedding file sizes?

Thank you!

ChantalMP commented 1 year ago

Hi!

I get exactly the same results as you (34539 pickles, 125628 with 19483 HAIM ids) and the ParserError as well.

Did you also have duplicated img_ids in the resulting files? So patients with different haim_id, but the same dicom id and do you have any idea what caused this?

malika996 commented 1 year ago

The same thing for me.

There're some rows in 'cxr_ic_fusion_1103.csv' with duplicated img_id but different haim_id. In addition, some rows are identical. After removing duplicates based on img_id only, the provided embedding data contains 35605 samples, and it has 45034 samples after removing duplicates based on all available columns.

lrsoenksen commented 1 year ago

Hi All, We will look into it. When we did the processing in a supercluster at MIT we had some error handling functions in case the server was getting hanged during extraction of embeddings or training of models as it was passing through samples. That may be the source of the discrepancy (if any). While we investigate the issue, we recommend to make the code your own, download the data from Physionet and attempt to use the pieces of our code that you find useful to generate embeddings and train models. Thank you so much for your patience.