mahmoodlab / HIPT

Hierarchical Image Pyramid Transformer - CVPR 2022 (Oral)
Other
509 stars 90 forks source link

Labels for cancer subtyping #16

Closed juanigp closed 2 years ago

juanigp commented 2 years ago

Hi, first of all thank you for the amazing work and codebase. I see that the data folds for the subtype classification task are provided in HIPT/2-Weakly-Supervised-Subtyping/splits/10foldcv_subtype.

However in the .csv files, the labels are not provided. I tried retrieving the diagnoses for all the .svs in the brca cohort using the gdc api and these are the different diagnoses:

'Adenoid cystic carcinoma', 'Apocrine adenocarcinoma', 'Basal cell carcinoma, NOS', 'Carcinoma, NOS', 'Cribriform carcinoma, NOS', 'Infiltrating duct and lobular carcinoma', 'Infiltrating duct carcinoma, NOS', 'Infiltrating duct mixed with other types of carcinoma', 'Infiltrating lobular mixed with other types of carcinoma', 'Intraductal micropapillary carcinoma', 'Intraductal papillary adenocarcinoma with invasion', 'Large cell neuroendocrine carcinoma', 'Lobular carcinoma, NOS', 'Medullary carcinoma, NOS', 'Metaplastic carcinoma, NOS', 'Mucinous adenocarcinoma', 'Paget disease and infiltrating duct carcinoma of breast', 'Papillary carcinoma, NOS', 'Phyllodes tumor, malignant', 'Pleomorphic carcinoma', 'Secretory carcinoma of breast', 'Tubular adenocarcinoma'

I would like to ask how did you derive the ILC and IDC labels that you use, or if you could provide the labels for the .svs files.

I am asking particularly about the brca cohort since it is the one I started inspecting, but probably a similar question would arise from the other cohorts, and their labels would be useful as well (:

juanigp commented 2 years ago

Are these the files with the labels: HIPT/tree/master/2-Weakly-Supervised-Subtyping/dataset_csv ?

In that case only the kidney file seems to be working. It would be awesome if you could provide the .csvs for the brca and lung dataset.

Thanks!

Richarizardd commented 2 years ago

Hi @juanigp - which columns are you using for getting the diagnoses? Are you using the oncotree_code column? See also the mapping of how I am discretizing these labels.

juanigp commented 2 years ago

Hi @Richarizardd thank you for the assistance. I retrieved the diagnoses using the gdc api. Those are the unique values of the field diagnoses.primary_diagnosis of all the cases from the tcga-brca.

I cannot seem to find the oncotree_code column in the gdc api.

I appreciate your help with this.

juanigp commented 2 years ago

@Richarizardd I wasn't using the brca .zip file in HIPT/tree/master/2-Weakly-Supervised-Subtyping/dataset_csv because the file seemed broken when trying to uncompress it.

However, I could parse the .zip file with pandas, and see the oncotree_code column. Weird behaviour for a .zip file!

I would still like to ask you, out of curiosity, where these labels come from, since the GDC seems to have a different convention for naming the subtypes.

clemsgrs commented 2 years ago

@Richarizardd I would also be interested in knowing how the labels in the .csv files under HIPT/tree/master/2-Weakly-Supervised-Subtyping/dataset_csv were generated.

Just like @juanigp, I also retrieved clinical data from GDC portal. I assume you mapped the primary_diagnosis values to IDC, ILC, MDLC, PD, ACBC, IMMC, BRCNOS, BRCA, SPC, MBC, MPT. Would be helpful if you could provide the mapping you used (e.g. if we want to train/tune/test on different TCGA BRCA slides, we'd need to infer IDC/ILC labels like you did). Thanks!

nauyan commented 1 year ago

@clemsgrs have you got an idea of how the labels are mapped?

clemsgrs commented 1 year ago

unfortunately, no

weiaicunzai commented 10 months ago

I believe they map  primary_diagnosis  values to IDC, ILC, etc. through OncoTree, which is introduced through the paper OncoTree: A Cancer Classification System for Precision Oncology. OncoTree is similar to the WordNet used by ImageNet.  For example, the primary_diagnosis value Infiltrating duct carcinoma, NOS in the OncoTree is mapped to Breast Invasive Ductal Carcinoma (IDC). However, I think this needs a doctor to know exactly which OncoTree code we should map to. There is also a mapping tool in github: https://github.com/cBioPortal/oncotree?tab=readme-ov-file

syy-create commented 7 months ago

May I ask how you ultimately resolved the mapping issue by obtaining IDC and ILC labels from the BRCA data downloaded from gdc?