TCGA-LUAD Subtype Labels

WinterPan2017 commented 4 months ago

Hello, could you please share the id-subtype pairs used in the Zero-Shot Histological Subtype Classification?

ls1rius commented 4 months ago

ok, data has been uploaded in WSI_FiVE/gpt_preprocess/luad_tcga_pub_clinical_data.tsv

WinterPan2017 commented 4 months ago

Thanks for reply. I noticed that there are 28 cases not included in DSMIL's pre-computed features (listed below), which might result in a mismatch in the test set. As illustrated in the table below, we count the number of slides in each subtypes (each sample may have multiple slides). Is there any issue with this?

Sample ID
TCGA-05-4384-01
TCGA-05-4389-01
TCGA-05-4410-01
TCGA-05-4425-01
TCGA-05-5420-01
TCGA-05-5423-01
TCGA-05-5715-01
TCGA-49-4486-01
TCGA-50-5049-01
TCGA-50-5051-01
TCGA-50-5072-01
TCGA-50-5932-01
TCGA-50-5933-01
TCGA-50-5935-01
TCGA-50-5936-01
TCGA-50-5941-01
TCGA-50-5944-01
TCGA-50-6595-01
TCGA-78-7143-01
TCGA-78-7162-01
TCGA-91-6828-01
TCGA-91-6829-01
TCGA-91-6835-01
TCGA-91-6840-01
TCGA-91-6847-01
TCGA-91-6849-01
TCGA-91-7771-01

subtype name	sample (paper)	sample	slide
Adenocarcinoma Mixed Subtype	54	47	51
Bronchioloalveolar Carcinoma Nonmucinous	10	8	8
Acinar Adenocarcinoma	5	4	4
Mucinous Adenocarcinoma	1	1	1
Micropapillary (colloid) Adenocarcinoma	4	3	7
Bronchioloalveolar Carcinoma Mucinous	3	3	3
Micropapillary Adenocarcinoma	3	2	2
Papillary Adenocarcinoma	8	8	17
Not Otherwise Specified (NOS)	136	120	158
NA	7	7	7

ls1rius commented 4 months ago

The data is obtained from a public database, the link is provided in the paper. As of now, all data with subtype labels are contained here. There is no way to calculate the data that is not included in DSMIL's pre-computed features. You need to remove this part of the data. Or you can also follow the DSMIL method to train a model to solve this problem.

WinterPan2017 commented 4 months ago

Appreciate your patience. So, the reported results are based on your own pre-computed features which contain all samples in the provided file, rather than DSMIL's pre-computed features.

ls1rius commented 4 months ago

The other experiments utilized DSMIL features. Due to the lack of data in this area, we trained these features ourselves.

WinterPan2017 commented 4 months ago

I noticed that the missing samples are not included in the pretraining-related CSV files, e.g. : LUAD_LUSC_data_desc_reid_delval.csv, TCGA_Reports.csv. Is it correct to say that only the Zero-Shot Histological Subtype Classification experiment is based on your own features, while the pretraining and other downstream tasks utilize DSMIL features?

ls1rius commented 4 months ago

yes

WinterPan2017 commented 4 months ago

Thanks a lot.

ls1rius / WSI_FiVE

TCGA-LUAD Subtype Labels #5