Closed juanigp closed 2 years ago
Are these the files with the labels: HIPT/tree/master/2-Weakly-Supervised-Subtyping/dataset_csv ?
In that case only the kidney file seems to be working. It would be awesome if you could provide the .csvs for the brca and lung dataset.
Thanks!
Hi @juanigp - which columns are you using for getting the diagnoses? Are you using the oncotree_code
column? See also the mapping of how I am discretizing these labels.
Hi @Richarizardd thank you for the assistance. I retrieved the diagnoses using the gdc api. Those are the unique values of the field diagnoses.primary_diagnosis
of all the cases from the tcga-brca.
I cannot seem to find the oncotree_code
column in the gdc api.
I appreciate your help with this.
@Richarizardd I wasn't using the brca .zip file in HIPT/tree/master/2-Weakly-Supervised-Subtyping/dataset_csv because the file seemed broken when trying to uncompress it.
However, I could parse the .zip file with pandas, and see the oncotree_code
column. Weird behaviour for a .zip file!
I would still like to ask you, out of curiosity, where these labels come from, since the GDC seems to have a different convention for naming the subtypes.
@Richarizardd I would also be interested in knowing how the labels in the .csv files under HIPT/tree/master/2-Weakly-Supervised-Subtyping/dataset_csv
were generated.
Just like @juanigp, I also retrieved clinical data from GDC portal. I assume you mapped the primary_diagnosis
values to IDC
, ILC
, MDLC
, PD
, ACBC
, IMMC
, BRCNOS
, BRCA
, SPC
, MBC
, MPT
. Would be helpful if you could provide the mapping you used (e.g. if we want to train/tune/test on different TCGA BRCA slides, we'd need to infer IDC/ILC labels like you did). Thanks!
@clemsgrs have you got an idea of how the labels are mapped?
unfortunately, no
I believe they map primary_diagnosis
values to IDC
, ILC
, etc. through OncoTree, which is introduced through the paper OncoTree: A Cancer Classification System for Precision Oncology. OncoTree is similar to the WordNet used by ImageNet. For example, the primary_diagnosis value Infiltrating duct carcinoma, NOS
in the OncoTree is mapped to Breast Invasive Ductal Carcinoma (IDC)
. However, I think this needs a doctor to know exactly which OncoTree code we should map to. There is also a mapping tool in github: https://github.com/cBioPortal/oncotree?tab=readme-ov-file
May I ask how you ultimately resolved the mapping issue by obtaining IDC and ILC labels from the BRCA data downloaded from gdc?
Hi, first of all thank you for the amazing work and codebase. I see that the data folds for the subtype classification task are provided in HIPT/2-Weakly-Supervised-Subtyping/splits/10foldcv_subtype.
However in the .csv files, the labels are not provided. I tried retrieving the diagnoses for all the .svs in the brca cohort using the gdc api and these are the different diagnoses:
'Adenoid cystic carcinoma', 'Apocrine adenocarcinoma', 'Basal cell carcinoma, NOS', 'Carcinoma, NOS', 'Cribriform carcinoma, NOS', 'Infiltrating duct and lobular carcinoma', 'Infiltrating duct carcinoma, NOS', 'Infiltrating duct mixed with other types of carcinoma', 'Infiltrating lobular mixed with other types of carcinoma', 'Intraductal micropapillary carcinoma', 'Intraductal papillary adenocarcinoma with invasion', 'Large cell neuroendocrine carcinoma', 'Lobular carcinoma, NOS', 'Medullary carcinoma, NOS', 'Metaplastic carcinoma, NOS', 'Mucinous adenocarcinoma', 'Paget disease and infiltrating duct carcinoma of breast', 'Papillary carcinoma, NOS', 'Phyllodes tumor, malignant', 'Pleomorphic carcinoma', 'Secretory carcinoma of breast', 'Tubular adenocarcinoma'
I would like to ask how did you derive the ILC and IDC labels that you use, or if you could provide the labels for the .svs files.
I am asking particularly about the brca cohort since it is the one I started inspecting, but probably a similar question would arise from the other cohorts, and their labels would be useful as well (: