ibrahimethemhamamci / CT-CLIP

Developing Generalist Foundation Models from a Multimodal Dataset for 3D Computed Tomography
197 stars 21 forks source link

AccessionNo. missing in the reports csv file #34

Open jackhu-bme opened 1 month ago

jackhu-bme commented 1 month ago

Currently I am trying to reproduce your train-from-scratch CLIP results.

I have pre-processed the volume data downloaded from huggingface repo (using provided data processing scripts) and downloaded all the csv files. However, when I use the scripts/run_train.py, I encounter error of missing values.

File "/home/***/baselines/CT-CLIP/scripts/CTCLIPTrainer.py", line 188, in __init__ self.ds = CTReportDataset(data_folder=data_train, csv_file=reports_file_train) File "/home/***/baselines/CT-CLIP/scripts/data.py", line 43, in __init__ self.accession_to_text = self.load_accession_text(csv_file) File "/home/***/baselines/CT-CLIP/scripts/data.py", line 66, in load_accession_text accession_to_text[row['AccessionNo']] = row["Findings_EN"],row['Impressions_EN']

*** is my name.

I have double checked the files provided in the huggingface repo, including

  1. the report csv file, where "Findings_EN" and "Impressions_EN" exists as the column name , but no "AccessionNo" exists. (both train and val)
  2. the metadata file, the label file, where no relavant column name exists.

Since then, I guess the file downloaded from huggingface repo: "https://huggingface.co/datasets/ibrahimhamamci/CT-RATE/tree/main/dataset/radiology_text_reports" train_reports.csv misses the column "'AccessionNo".

Is this guess true? Or did I miss anything?

Thanks a lot for your time, and really thankful for your open source code and datasets.

jackhu-bme commented 1 month ago

I see that the accessionNo seems to be the index number of patients, I am currerntly trying to fix the bug by converting the volumename to accessionNo.

I referenced your code /home/***/baselines/CT-CLIP/scripts/data.py, line77 accession_number = nii_file.split("/")[-1]

It seem reasonable to get the missing accessionNo by converting your nii file name.

If this one is not working or is wrong, please tell me. By the way, this is a bug to be fixed, or makes trouble to others using this dataset and repo.