ibrahimethemhamamci / CT-CLIP

Developing Generalist Foundation Models from a Multimodal Dataset for 3D Computed Tomography
197 stars 21 forks source link

Labels for supervised training #20

Open LangDaniel opened 5 months ago

LangDaniel commented 5 months ago

Thanks for providing this awesome dataset.

Looking at the train_predicted_labels.csv and valid_predicted_labels.csv files from huggingface, I get different numbers of class counts presented in Supplementary Table 1 of the paper.

For example, for Medical material:

ctrate

which differs from the 2387 and 130 in the paper.

I was also wondering if you could provide the mapping from RAD-ChestCT labels to CT-RATE labels?

sezginerr commented 5 months ago

Hi @LangDaniel,

These CSV files contain rows for different reconstructions. It is not correct to count labels for each reconstruction as a new pathology, as different reconstructions are not statistically independent samples (i.e. train_1_a_1.nii.gz column and train_1_a_2.nii.gz column should be counted as 1 pathology not 2 pathologies). I will, nevertheless, check the values in the paper, but they should be for each scan, not for each reconstruction. Therefore, it is expected that if you read the CSV files and count the rows for the pathology column, the total number of pathologies will be higher. I hope this clarifies it.

LangDaniel commented 5 months ago

Hi @sezginerr,

thanks for clarifying! Unfortunately, I was still not able to reproduce the numbers, checked on patient as well as lesion level. Is there anything else I may be missing?

Would be great if you could share the RAD-ChestCT to CT-Rate label mapping, as it is not completely obvious.

sezginerr commented 5 months ago

Hi @LangDaniel,

At one point, we improved the report classifier model. We might have forgotten to update the values in the supplementary table. The train-validation split is still the same as in the preprint, nevertheless. So you can use the labels in the Huggingface repository. We plan to update the preprint in the next couple of weeks. I will update the values as well.

I forgot to reply to this: "I was also wondering if you could provide the mapping from RAD-ChestCT labels to CT-RATE labels?" I believe you are referring to the external validation part of the preprint. RAD-ChestCT does not have aortic-coronary calcification labels. They have a calcification label. So we merged our labels for coronary and aortic calcifications as calcification. If either is 1, then the label is 1; if both are 0, then the label is 0. Regarding the model output, for zero-shot and CT-VocabFine, we use the "calcification" input. For supervised and CT-LiPro, we merge the outputs of the model as well. So if one of the outputs is 1, the calcification output is 1; if both are 0, then it is 0. There is also no Mosaic attenuation pattern label in RAD-ChestCT. So we do not calculate scores for that in the external validation.

LangDaniel commented 5 months ago

Hi @sezginerr ,

thanks for clarifying.

Concerning the CT-RATE to RAD-ChestCT label conversion: The calcification and Mosaic attenuation pattern labels are clear to me, as this is stated in the paper. However, there are CT-RATE labels which don't have a simple equivalent in the RAD-ChestCT dataset, e.g. Medical material. So, I was wondering if you could provide the conversation scheme you used. (Sorry if this is provided and I missed it) For me, currently it looks like the following:

CTRATE_to_RADChestCT = {
    'Medical material': '',                                 #? 
    'Arterial wall calcification': 'calcification',
    'Cardiomegaly': 'cardiomegaly',
    'Pericardial effusion': 'pericardial_effusion',
    'Coronary artery wall calcification': 'calcification',
    'Hiatal hernia': 'hernia',
    'Lymphadenopathy': 'lymphadenopathy',
    'Emphysema': 'emphysema',
    'Atelectasis': 'atelectasis',
    'Lung nodule': 'nodule',
    'Lung opacity': 'opacity',
    'Pulmonary fibrotic sequela': 'pulmonary_edema',         #?
    'Pleural effusion': 'pleural_effusion',
    'Mosaic attenuation pattern': None,
    'Peribronchial thickening': '',                          #?
    'Consolidation': 'consolidation',
    'Bronchiectasis': 'bronchiectasis',
    'Interlobular septal thickening': 'septal_thickening'    #?    
}
sezginerr commented 5 months ago

Hi @LangDaniel, sorry for the delay me and Ibrahim were out of the office. I understood the question now. It has been a while since we arranged the RAD-ChestCT data so I forgot we have done this. Here is the mapping dict:

mapping_dict = {
    "Medical material": ["pacemaker_or_defib", "catheter_or_port", "hardware", "stent", "suture", "staple", "chest_tube", "tracheal_tube", "gi_tube", "breast_implant", "heart_valve_replacement", "clip"],
    "Arterial wall calcification": ["calcification", "scattered_calc"],
    "Cardiomegaly": ["cardiomegaly"],
    "Pericardial effusion": ["pericardial_effusion"],
    "Coronary artery wall calcification": ["calcification", "scattered_calc"],
    "Hiatal hernia": ["hernia"],
    "Lymphadenopathy": ["lymphadenopathy"],
    "Emphysema": ["emphysema"],
    "Atelectasis": ["atelectasis"],
    "Lung nodule": ["nodule", "nodulegr1cm", "scattered_nod"],
    "Lung opacity": ["opacity"],
    "Pulmonary fibrotic sequela": ["fibrosis"],
    "Pleural effusion": ["pleural_effusion"],
    "Mosaic attenuation pattern": ["all_zeros"],
    "Peribronchial thickening": ["bronchial_wall_thickening"],
    "Consolidation": ["consolidation"],
    "Bronchiectasis": ["bronchiectasis"],
    "Interlobular septal thickening": ["septal_thickening"]
}

Please let me know if you have any further questions.