waldronlab / cBioPortalData

Integrate the cancer genomics portal, cBioPortal, using MultiAssayExperiment
https://waldronlab.io/cBioPortalData/
30 stars 12 forks source link

migrate from cgdsr to cBioPortalData #52

Closed kmezhoud closed 2 years ago

kmezhoud commented 2 years ago

Dear all, I suppose that all packages depending on cgdsr will use cBioPortalData. Concretely, 1- Is there an equivalent to theses 6 commands? 2- The structure of data in cgdsr remain the same in cBioPortalData with:

Rapidly I saw that getCancerStudies is mutated by getStudies ...

Thanks Karim

cgds ← CGDS("http://cbioportal.org/public-portal/”)
Studies ← getCancerStudies(cgds)
GenProf ← getGeneticProfiles(cgds, "gbm_tcga_pub")
Cases ← getCaseLists(cgds,"gbm_tcga_pub")
ClinData← getClinicalData(cgds,"gbm_tcga_pub_all")
ProfData← getProfileData(cgds, "NF1",“gbm_tcga_pub_mrna", "gbm_tcga_pub_all")
LiNk-NY commented 2 years ago

Hi Karim! @kmezhoud I hope you're well.

cBioPortalData is not a direct migration from cgdsr. It is mainly an implementation to facilitate data download from the bulk tarballs and the API via cBioDataPack and cBioPortalData, respectively.

Please see the vignette for developers here: https://waldronlab.io/cBioPortalData/articles/cBioPortalRClient.html Feel free to post any further questions.

Best regards, Marcel

kmezhoud commented 2 years ago

Dear Ramos, Thanks!

Here I will try to compare this two packages and understand the different approaches?

If I resume, the main notes are:

I tried to get mutation data of some genes using Entrez or Symbole but without succes. Please How to do to get molecularData if we know sampleListId, molecularProfileId and GeneList? StudyId remains optional since sampleListId and molecularProfilesId are unique.

Thanks, Say hello to Levis :-). Karim

get Studies

library(cgdsr)
cgds<-CGDS("http://www.cbioportal.org/")
getCancerStudies.CGDS(cgds) %>%
    pull(cancer_study_id) %>%
    sort() %>%
    head()

[1] "acbc_mskcc_2015"             "acc_2019"                    "acc_tcga"                    "acc_tcga_pan_can_atlas_2018"
[5] "acyc_fmi_2014"               "acyc_jhu_2016"  
library(dplyr)
library(cBioPortalData)
cbio <- cBioPortal(
  hostname = "www.cbioportal.org",
  protocol = "https",
  api. = "/api/api-docs"
)

getStudies(cbio) %>% 
    pull(studyId) %>%
    sort() %>%
    head()

[1] "acbc_mskcc_2015"             "acc_2019"                    "acc_tcga"                    "acc_tcga_pan_can_atlas_2018"
[5] "acyc_fmi_2014"               "acyc_jhu_2016"  

As you can see the two packages use the same hostname but with different protocol (insecure and secure).

They return the same list of Studies with the same dataframe/tibble dataset.

get Cases & Clinical Data

mycase <- getCaseLists.CGDS(cgds,cancerStudy = "gbm_tcga_pub") %>%
         pull(case_list_id) %>%
         first()

chr [1:15] "gbm_tcga_pub_all" "gbm_tcga_pub_expr_classical" "gbm_tcga_pub_expr_mesenchymal" "gbm_tcga_pub_expr_neural" ...

## get Clinical Data, we need to specify the case ID or Sample list ID
getClinicalData.CGDS(x = cgds, caseList =   mycase) %>%
    str()

[1] "gbm_tcga_pub_all"                  "gbm_tcga_pub_expr_classical"       "gbm_tcga_pub_expr_mesenchymal"    
 [4] "gbm_tcga_pub_expr_neural"          "gbm_tcga_pub_expr_proneural"       "gbm_tcga_pub_cna"                 
 [7] "gbm_tcga_pub_methylation_all"      "gbm_tcga_pub_methylation_hm27"     "gbm_tcga_pub_microrna"            
[10] "gbm_tcga_pub_mrna"                 "gbm_tcga_pub_cnaseq"               "gbm_tcga_pub_sequenced"           
[13] "gbm_tcga_pub_sequenced_nohyper"    "gbm_tcga_pub_sequenced_nottreated" "gbm_tcga_pub_sequenced_treated"   

In cgdsr User has to specify casesId or sampleListId to get clinical data.

#getSampleInfo(api = cbio, studyId = "gbm_tcga_pub", projection = c("SUMMARY", "ID", "DETAILED", "META"))

# get Cases or Sample list ID
myCase_cbio <- sampleLists(api = cbio, studyId = "gbm_tcga_pub") %>% 
         pull(sampleListId) %>% 
         str()

chr [1:15] "gbm_tcga_pub_all" "gbm_tcga_pub_expr_classical" "gbm_tcga_pub_expr_mesenchymal" "gbm_tcga_pub_expr_neural" ...

## get Clinical data
clinicalData(api = cbio, studyId = "gbm_tcga_pub") %>%
      str()

tibble [206 × 24] (S3: tbl_df/tbl/data.frame)
 $ patientId                  : chr [1:206] "TCGA-02-0001" "TCGA-02-0003" "TCGA-02-0004" "TCGA-02-0006" ...
 $ DFS_MONTHS                 : chr [1:206] "4.504109589" "1.315068493" "10.32328767" "9.928767123" ...
 $ DFS_STATUS                 : chr [1:206] "1:Recurred" "1:Recurred" "1:Recurred" "1:Recurred" ...
 $ KARNOFSKY_PERFORMANCE_SCORE: chr [1:206] "80.0" "100.0" "80.0" "80.0" ...
 $ OS_MONTHS                  : chr [1:206] "11.60547945" "4.734246575" "11.34246575" "18.34520548" ...
 $ OS_STATUS                  : chr [1:206] "1:DECEASED" "1:DECEASED" "1:DECEASED" "1:DECEASED" ...
 $ PRETREATMENT_HISTORY       : chr [1:206] "YES" "NO" "NO" "NO" ...
 $ PRIOR_GLIOMA               : chr [1:206] "NO" "NO" "NO" "NO" ...
 $ SAMPLE_COUNT               : chr [1:206] "1" "1" "1" "1" ...
 $ SEX                        : chr [1:206] "Female" "Male" "Male" "Female" ...
 $ sampleId                   : chr [1:206] "TCGA-02-0001-01" "TCGA-02-0003-01" "TCGA-02-0004-01" "TCGA-02-0006-01" ...
 $ ACGH_DATA                  : chr [1:206] "YES" "YES" "NO" "YES" ...
 $ CANCER_TYPE                : chr [1:206] "Glioblastoma Multiforme" "Glioblastoma Multiforme" "Glioblastoma Multiforme" "Glioblastoma Multiforme" ...
 $ CANCER_TYPE_DETAILED       : chr [1:206] "Glioblastoma Multiforme" "Glioblastoma Multiforme" "Glioblastoma Multiforme" "Glioblastoma Multiforme" ...
 $ COMPLETE_DATA              : chr [1:206] "YES" "YES" "NO" "YES" ...
 $ FRACTION_GENOME_ALTERED    : chr [1:206] "0.2459" "0.1480" NA "0.2391" ...
 $ MRNA_DATA                  : chr [1:206] "YES" "YES" "YES" "YES" ...
 $ MUTATION_COUNT             : chr [1:206] "3" "5" NA NA ...
 $ ONCOTREE_CODE              : chr [1:206] "GBM" "GBM" "GBM" "GBM" ...
 $ SAMPLE_TYPE                : chr [1:206] "Primary" "Primary" "Primary" "Primary" ...
 $ SEQUENCED                  : chr [1:206] "YES" "YES" "YES" "YES" ...
 $ SOMATIC_STATUS             : chr [1:206] "Matched" "Matched" "Matched" "Matched" ...
 $ TMB_NONSYNONYMOUS          : chr [1:206] "2.36904510899" "3.94840851498" NA "0.0" ...
 $ TREATMENT_STATUS           : chr [1:206] "Untreated" "Untreated" "Untreated" "Untreated" ...

In cBioPortalData we can get Clinical data without specifying sampleListId. In this case we get all clinical data for all molecularProfilesId

get Genetic Profiles or Molecular Profiles

getGeneticProfiles.CGDS(cgds,cancerStudy = "gbm_tcga_pub" ) %>%
    select(genetic_profile_id, genetic_profile_name, everything()) %>%
    str()

'data.frame':   10 obs. of  6 variables:
 $ genetic_profile_id          : chr  "gbm_tcga_pub_cna_rae" "gbm_tcga_pub_cna_consensus" "gbm_tcga_pub_mutations" "gbm_tcga_pub_methylation_hm27" ...
 $ genetic_profile_name        : chr  "Putative copy-number alterations (RAE)" "Putative copy-number alterations (Consensus)" "Mutations" "Methylation (HM27)" ...
 $ genetic_profile_description : chr  "Putative copy-number calls for all genes in 203 GBM cases. Copy number calls were determined from the Agilent 2"| __truncated__ "Putative copy-number calls for genes implicated in glioblastoma (206 cases). These calls were used for the path"| __truncated__ "Mutation data for targeted sequencing in 91 primary glioblastoma tumor/normal pairs (Phases I/II of the TCGA gl"| __truncated__ "Methylation beta-values (Infinium HumanMethylation27 platform). For genes with multiple methylation probes, the"| __truncated__ ...
 $ cancer_study_id             : int  100 100 100 100 100 100 100 100 100 100
 $ genetic_alteration_type     : chr  "COPY_NUMBER_ALTERATION" "COPY_NUMBER_ALTERATION" "MUTATION_EXTENDED" "METHYLATION" ...
 $ show_profile_in_analysis_tab: chr  "true" "true" "true" "false" ...
molecularProfiles(api = cbio, studyId = "gbm_tcga_pub") %>%
    select(molecularProfileId, name, everything()) %>%
    str()

tibble [10 × 8] (S3: tbl_df/tbl/data.frame)
 $ molecularAlterationType : chr [1:10] "COPY_NUMBER_ALTERATION" "COPY_NUMBER_ALTERATION" "MUTATION_EXTENDED" "METHYLATION" ...
 $ datatype                : chr [1:10] "DISCRETE" "DISCRETE" "MAF" "CONTINUOUS" ...
 $ name                    : chr [1:10] "Putative copy-number alterations (RAE)" "Putative copy-number alterations (Consensus)" "Mutations" "Methylation (HM27)" ...
 $ description             : chr [1:10] "Putative copy-number calls for all genes in 203 GBM cases. Copy number calls were determined from the Agilent 2"| __truncated__ "Putative copy-number calls for genes implicated in glioblastoma (206 cases). These calls were used for the path"| __truncated__ "Mutation data for targeted sequencing in 91 primary glioblastoma tumor/normal pairs (Phases I/II of the TCGA gl"| __truncated__ "Methylation beta-values (Infinium HumanMethylation27 platform). For genes with multiple methylation probes, the"| __truncated__ ...
 $ showProfileInAnalysisTab: logi [1:10] TRUE TRUE TRUE FALSE FALSE TRUE ...
 $ patientLevel            : logi [1:10] FALSE FALSE FALSE FALSE FALSE FALSE ...
 $ molecularProfileId      : chr [1:10] "gbm_tcga_pub_cna_rae" "gbm_tcga_pub_cna_consensus" "gbm_tcga_pub_mutations" "gbm_tcga_pub_methylation_hm27" ...
 $ studyId                 : chr [1:10] "gbm_tcga_pub" "gbm_tcga_pub" "gbm_tcga_pub" "gbm_tcga_pub" ...

get Profile Data or molecular Data (mRNA expression) for specific gene list Entrez/Hugo Symbol

library(tictoc)
tic("cgdsr:")
getProfileData.CGDS(x = cgds, 
                    genes = c("NF1", "TP53", "ABL1"),
                    geneticProfiles = "gbm_tcga_pub_mrna", 
                    caseList = "gbm_tcga_pub_all") %>%
                    head()
toc()

cgdsr:: 0.515 sec elapsed

Screenshot_20220418_184201

get gene Entrez ID from gene Hugo Symbol


# get all genPanelId
all_genePanelId <- genePanels(api = cbio) %>% pull(genePanelId)

## get all Genes entrez/symbol from all genePanelID, rm duplicates
all_genes_tbl <- lapply(X =all_genePanelId, function(x) getGenePanel(api = cbio, genePanelId = x)) %>%
                 bind_rows() %>%
                 distinct()
#    group_by(entrezGeneId, hugoGeneSymbol) %>%
#    filter(n()>1) %>%
#    summarize(n=n(), .groups = "rowwise")

Our_gene_entrez <- all_genes_tbl %>%
                    filter(hugoGeneSymbol %in% c("NF1", "TP53", "ABL1")) %>%
                     pull(entrezGeneId)

##  [1] 7157   25 4763

tic("cBioPortalData")
molecularData(api = cbio, 
              molecularProfileIds = "gbm_tcga_pub_mrna",
              entrezGeneIds = Our_gene_entrez,
              sampleIds = "gbm_tcga_pub_all")
toc()

named list()
cBioPortalData: 0.178 sec elapsed

The output is empty.

Try cBioPOrtalData as mentioned in issue #30.

## with Enterez
cBioPortalData(
  api = cbio,
  studyId = "gbm_tcga",
  #genePanelId = "AmpliSeq",
  genes = Our_gene_entrez, #c("NF1", "P53", "BRCA1", "BRCA2"),
  molecularProfileIds = "gbm_tcga_pub_mrna",
  #sampleListId = "gbm_tcga_pub_all",
  sampleIds = "gbm_tcga_pub_all",
  by = "entrezGeneId" #, "hugoGeneSymbol"
)

Erreur dans split.default(x = seq_len(nrow(x)), f = f, drop = drop, ...) : 
la taille du groupe est 0 mais la taille des données est > 0

## with Symbol
cBioPortalData(
  api = cbio,
  studyId = "gbm_tcga",
  #genePanelId = "AmpliSeq",
  genes = c("NF1", "P53", "BRCA1", "BRCA2"),
  molecularProfileIds = "gbm_tcga_pub_mrna",
  #sampleListId = "gbm_tcga_pub_all",
  sampleIds = "gbm_tcga_pub_all",
  by = "hugoGeneSymbol"
)

Erreur dans split.default(x = seq_len(nrow(x)), f = f, drop = drop, ...) : 
la taille du groupe est 0 mais la taille des données est > 0

get mRNA expression with cBioPortalData function and existing genePanelId

gbm <-cBioPortalData(api = cbio, 
               by = "hugoGeneSymbol", 
               studyId = "gbm_tcga",
              genePanelId = "IMPACT341",
              molecularProfileIds = "gbm_tcga_pub_mrna", #c("gbm_tcga_rppa", "gbm_tcga_mrna")
)

gbm@ExperimentList@listData$gbm_tcga_pub_mrna@assays@data@listData[[1]] %>%
    as.data.frame() %>%
    head

image

gbm@ExperimentList@listData$gbm_tcga_pub_mrna@assays@data@listData[[1]] %>%
    rownames() %>%
    grepl(.,c("NF1", "TP53","ABL1"))

[1] FALSE FALSE  TRUE

ABL1 exists, but NF1 and TP53 do not exist.

Try with getDataByGenes


## with Symbol
getDataByGenes(
  api = cbio,
  studyId = "gbm_tcga",
  genes =c("NF1", "P53", "ABL1"),
  #genePanelId = NA_character_,
  by = "hugoGeneSymbol",
  molecularProfileIds = "gbm_tcga_pub_mrna",
  #sampleListId = "gbm_tcga_pub_all",
  sampleIds = "gbm_tcga_pub_all"
)

# named list()

## With Entrez
getDataByGenes(
  api = cbio,
  studyId = "gbm_tcga",
  genes = Our_gene_entrez,
  #genePanelId = NA_character_,
  by = "entrezGeneId",
  molecularProfileIds = "gbm_tcga_pub_mrna",
  #sampleListId = "gbm_tcga_pub_all",
  sampleIds = "gbm_tcga_pub_all"
)

named list()

ABL1 is not returned!

get mutation

getMutationData.CGDS(x=cgds, 
                     caseList = "getMutationData",
                     geneticProfile = "gbm_tcga_pub_mutations",
                     genes = c("NF1", "TP53", "ABL1")) %>%
    select(entrez_gene_id, gene_symbol, amino_acid_change, everything()) %>%
    head()

image

getDataByGenes(
  api = cbio,
  studyId = "gbm_tcga",
  genes = Our_gene_entrez,
  #genePanelId = NA_character_,
  by = "entrezGeneId",
  molecularProfileIds = "gbm_tcga_pub_mutations",
  #sampleListId = "gbm_tcga_pub_all",
  sampleIds = "gbm_tcga_pub_all"
)

Erreur dans byGeneList[mutation] <- mutationData(api, molecularProfileIds[mutation], : 
l'argument de remplacement est de longueur nulle

cBioPortalData(
  api = cbio,
  studyId = "gbm_tcga",
  #genePanelId = "AmpliSeq",
  genes = c("NF1", "P53", "BRCA1", "ABL1"),
  molecularProfileIds = "gbm_tcga_pub_mutations",
  #sampleListId = "gbm_tcga_pub_all",
  sampleIds = "gbm_tcga_pub_all",
  by = "hugoGeneSymbol"
)

Erreur dans byGeneList[mutation] <- mutationData(api, molecularProfileIds[mutation], : 
l'argument de remplacement est de longueur nulle
LiNk-NY commented 2 years ago

Hi Karim, @kmezhoud

Thank you for this comprehensive comparison! I can add this to the package as a vignette (with attribution ofc) for those looking to migrate their code from cgds to cBioPortalData.

The examples you provided mixed the use of gbm_tcga and gbm_tcga_pub and that's why you were seeing empty responses.

The molecularData operation could use a bit more flexibility in terms of inputs. I will work on a hugoGeneSymbol input. These are lower level functions and are not very user friendly. If you're looking to get to the data straightaway, you can simply do:

cbio <- cBioPortal()
gbm_pub <- cBioPortalData(cbio, "gbm_tcga_pub", genes = c("NF1", "TP53", "ABL1"), by = "hugoGeneSymbol", molecularProfileIds = "gbm_tcga_pub_mrna")
assay(gbm_pub[["gbm_tcga_pub_mrna"]])

Best regards, Marcel

LiNk-NY commented 2 years ago

Update: I've added the ability to query the API for a table of gene symbols:

cbio <- cBioPortal()
queryGeneTable(cbio,
    by = "hugoGeneSymbol",
    genes = c("NF1", "TP53", "ABL1")
)

and a vignette to allow developers to migrate from cgds to cBioPortalData at https://github.com/waldronlab/cBioPortalData/blob/devel/vignettes/cgdsMigration.Rmd

Your feedback is welcome. Thanks!