waldronlab / TCGAutils

Toolbox package for organizing and working with TCGA data
https://bioconductor.org/packages/TCGAutils
23 stars 6 forks source link

getSubtypeMap Returns Invalid Column Names #32

Closed DarioS closed 2 years ago

DarioS commented 2 years ago

The column names are not found in the data set. An example is

library(curatedTCGAData)
library(TCGAutils)
headNeck <- curatedTCGAData("HNSC", c("RNASeq2Gene", "Mutation"), "2.0.1", FALSE)
> getSubtypeMap(headNeck)
      HNSC_annotations HNSC_subtype
1           Patient_ID      Barcode
2        mrna_subtypes          RNA
3 methylation_subtypes  Methylation
4     protein_subtypes         RPPA
5    microrna_subtypes        miRNA
6        scna_subtypes  Copy Number
7 integrative_subtypes     PARADIGM
> colData(headNeck)[, "mrna_subtypes"]
Error: subscript contains invalid names

Also, Genomic Classification of Cutaneous Melanoma, Cell, 2015 has

BRAF Subtype The largest genomic subtype is defined by the presence of BRAF hot-spot mutations. RAS Subtype The second major subtype is defined by the presence of RAS hot-spot mutations, including known amino acid changes with functional consequences, in all three RAS family members (N-, K- and H-RAS). NF1 Subtype The third most frequently observed SMG in the MAPK pathway was NF1, which was mutated in 14% of samples. Triple Wild-Type Subtype We defined the Triple-WT subtype (n = 46) as a heterogeneous subgroup characterized by a lack of hot-spot BRAF, N/H/K-RAS, or NF1 mutations.

and I find that I can't access those, either, although they seem to be curated.

> getSubtypeMap(cutaneousMelanoma)
      SKCM_annotations              SKCM_subtype
1           Patient_ID                      Name
2    mutation_subtypes          MUTATIONSUBTYPES
3        mrna_subtypes RNASEQ-CLUSTER_CONSENHIER
4 methylation_subtypes          MethTypes.201408
5    microrna_subtypes                MIRCluster
6     protein_subtypes            ProteinCluster
7      pathway_cluster           OncoSignCluster
> colData(cutaneousMelanoma)[, "mutation_subtypes"]
Error: subscript contains invalid names

Considering patient ID, it seems the name styles might have changed over time. Note that first column name is actually patientID and not Patient_ID as the getSubtypeMap function reports it. I wonder if other column names are similarly incorrect.

> colData(cutaneousMelanoma)[1:5, 1:2]
DataFrame with 5 rows and 2 columns
                patientID years_to_birth
              <character>      <integer>
TCGA-BF-A1PU TCGA-BF-A1PU             46
TCGA-BF-A1PV TCGA-BF-A1PV             74
TCGA-BF-A1PX TCGA-BF-A1PX             56
TCGA-BF-A1PZ TCGA-BF-A1PZ             71
TCGA-BF-A1Q0 TCGA-BF-A1Q0             80
LiNk-NY commented 2 years ago

Hi Dario, @DarioS I can take a closer look but it seems that you should be searching for and finding the column names in the **_subtype column, e.g.,

colData(headNeck)[, getSubtypeMap(headNeck)$HNSC_subtype[getSubtypeMap(headNeck)$HNSC_subtype %in% names(colData(headNeck))]]
DataFrame with 521 rows and 5 columns
                     RNA Methylation      RPPA     miRNA  PARADIGM
             <character> <character> <integer> <integer> <integer>
TCGA-4P-AA8J          NA          NA        NA        NA        NA
TCGA-BA-4074          NA          NA        NA        NA        NA
TCGA-BA-4075          NA          NA        NA        NA        NA
TCGA-BA-4076          NA          NA        NA        NA        NA

The subtypes seem to be all NA in both version 1.1.38 or 2.0.1 for HNSC. Best, Marcel

DarioS commented 2 years ago

Ah, I see. Perhaps that could be explicitly stated in the documentation.

The getSubtypeMap function provides a 2 column data.frame with in-data variable names and an interpreted names.

It is unclear from this statement which column to use to subset colData's DataFrame with.