Closed kmezhoud closed 2 years ago
Hi Karim! @kmezhoud I hope you're well.
cBioPortalData
is not a direct migration from cgdsr
.
It is mainly an implementation to facilitate data download from the bulk tarballs and the API via cBioDataPack
and cBioPortalData
, respectively.
Please see the vignette for developers here: https://waldronlab.io/cBioPortalData/articles/cBioPortalRClient.html Feel free to post any further questions.
Best regards, Marcel
Dear Ramos, Thanks!
Here I will try to compare this two packages and understand the different approaches?
If I resume, the main notes are:
cgdsr
and cBioPortalData
use the same hostname http://www.cbioportal.org/
cBioportalData
returns empty data compared to cgdsr
. SampleListId
. cBioportalData does not use sampleListId
to query ClinicalDatagenPanelId
is associated to StudyId
or SampleListId
or molecularProfileId
?I tried to get mutation data of some genes using Entrez or Symbole but without succes.
Please How to do to get molecularData if we know sampleListId
, molecularProfileId
and GeneList
?
StudyId
remains optional since sampleListId
and molecularProfilesId
are unique.
Thanks, Say hello to Levis :-). Karim
library(cgdsr)
cgds<-CGDS("http://www.cbioportal.org/")
getCancerStudies.CGDS(cgds) %>%
pull(cancer_study_id) %>%
sort() %>%
head()
[1] "acbc_mskcc_2015" "acc_2019" "acc_tcga" "acc_tcga_pan_can_atlas_2018"
[5] "acyc_fmi_2014" "acyc_jhu_2016"
library(dplyr)
library(cBioPortalData)
cbio <- cBioPortal(
hostname = "www.cbioportal.org",
protocol = "https",
api. = "/api/api-docs"
)
getStudies(cbio) %>%
pull(studyId) %>%
sort() %>%
head()
[1] "acbc_mskcc_2015" "acc_2019" "acc_tcga" "acc_tcga_pan_can_atlas_2018"
[5] "acyc_fmi_2014" "acyc_jhu_2016"
As you can see the two packages use the same hostname but with different protocol (insecure and secure).
They return the same list of Studies with the same dataframe/tibble dataset.
mycase <- getCaseLists.CGDS(cgds,cancerStudy = "gbm_tcga_pub") %>%
pull(case_list_id) %>%
first()
chr [1:15] "gbm_tcga_pub_all" "gbm_tcga_pub_expr_classical" "gbm_tcga_pub_expr_mesenchymal" "gbm_tcga_pub_expr_neural" ...
## get Clinical Data, we need to specify the case ID or Sample list ID
getClinicalData.CGDS(x = cgds, caseList = mycase) %>%
str()
[1] "gbm_tcga_pub_all" "gbm_tcga_pub_expr_classical" "gbm_tcga_pub_expr_mesenchymal"
[4] "gbm_tcga_pub_expr_neural" "gbm_tcga_pub_expr_proneural" "gbm_tcga_pub_cna"
[7] "gbm_tcga_pub_methylation_all" "gbm_tcga_pub_methylation_hm27" "gbm_tcga_pub_microrna"
[10] "gbm_tcga_pub_mrna" "gbm_tcga_pub_cnaseq" "gbm_tcga_pub_sequenced"
[13] "gbm_tcga_pub_sequenced_nohyper" "gbm_tcga_pub_sequenced_nottreated" "gbm_tcga_pub_sequenced_treated"
In cgdsr
User has to specify casesId
or sampleListId
to get clinical data.
#getSampleInfo(api = cbio, studyId = "gbm_tcga_pub", projection = c("SUMMARY", "ID", "DETAILED", "META"))
# get Cases or Sample list ID
myCase_cbio <- sampleLists(api = cbio, studyId = "gbm_tcga_pub") %>%
pull(sampleListId) %>%
str()
chr [1:15] "gbm_tcga_pub_all" "gbm_tcga_pub_expr_classical" "gbm_tcga_pub_expr_mesenchymal" "gbm_tcga_pub_expr_neural" ...
## get Clinical data
clinicalData(api = cbio, studyId = "gbm_tcga_pub") %>%
str()
tibble [206 × 24] (S3: tbl_df/tbl/data.frame)
$ patientId : chr [1:206] "TCGA-02-0001" "TCGA-02-0003" "TCGA-02-0004" "TCGA-02-0006" ...
$ DFS_MONTHS : chr [1:206] "4.504109589" "1.315068493" "10.32328767" "9.928767123" ...
$ DFS_STATUS : chr [1:206] "1:Recurred" "1:Recurred" "1:Recurred" "1:Recurred" ...
$ KARNOFSKY_PERFORMANCE_SCORE: chr [1:206] "80.0" "100.0" "80.0" "80.0" ...
$ OS_MONTHS : chr [1:206] "11.60547945" "4.734246575" "11.34246575" "18.34520548" ...
$ OS_STATUS : chr [1:206] "1:DECEASED" "1:DECEASED" "1:DECEASED" "1:DECEASED" ...
$ PRETREATMENT_HISTORY : chr [1:206] "YES" "NO" "NO" "NO" ...
$ PRIOR_GLIOMA : chr [1:206] "NO" "NO" "NO" "NO" ...
$ SAMPLE_COUNT : chr [1:206] "1" "1" "1" "1" ...
$ SEX : chr [1:206] "Female" "Male" "Male" "Female" ...
$ sampleId : chr [1:206] "TCGA-02-0001-01" "TCGA-02-0003-01" "TCGA-02-0004-01" "TCGA-02-0006-01" ...
$ ACGH_DATA : chr [1:206] "YES" "YES" "NO" "YES" ...
$ CANCER_TYPE : chr [1:206] "Glioblastoma Multiforme" "Glioblastoma Multiforme" "Glioblastoma Multiforme" "Glioblastoma Multiforme" ...
$ CANCER_TYPE_DETAILED : chr [1:206] "Glioblastoma Multiforme" "Glioblastoma Multiforme" "Glioblastoma Multiforme" "Glioblastoma Multiforme" ...
$ COMPLETE_DATA : chr [1:206] "YES" "YES" "NO" "YES" ...
$ FRACTION_GENOME_ALTERED : chr [1:206] "0.2459" "0.1480" NA "0.2391" ...
$ MRNA_DATA : chr [1:206] "YES" "YES" "YES" "YES" ...
$ MUTATION_COUNT : chr [1:206] "3" "5" NA NA ...
$ ONCOTREE_CODE : chr [1:206] "GBM" "GBM" "GBM" "GBM" ...
$ SAMPLE_TYPE : chr [1:206] "Primary" "Primary" "Primary" "Primary" ...
$ SEQUENCED : chr [1:206] "YES" "YES" "YES" "YES" ...
$ SOMATIC_STATUS : chr [1:206] "Matched" "Matched" "Matched" "Matched" ...
$ TMB_NONSYNONYMOUS : chr [1:206] "2.36904510899" "3.94840851498" NA "0.0" ...
$ TREATMENT_STATUS : chr [1:206] "Untreated" "Untreated" "Untreated" "Untreated" ...
In cBioPortalData
we can get Clinical data without specifying sampleListId
. In this case we get all clinical data for all molecularProfilesId
getGeneticProfiles.CGDS(cgds,cancerStudy = "gbm_tcga_pub" ) %>%
select(genetic_profile_id, genetic_profile_name, everything()) %>%
str()
'data.frame': 10 obs. of 6 variables:
$ genetic_profile_id : chr "gbm_tcga_pub_cna_rae" "gbm_tcga_pub_cna_consensus" "gbm_tcga_pub_mutations" "gbm_tcga_pub_methylation_hm27" ...
$ genetic_profile_name : chr "Putative copy-number alterations (RAE)" "Putative copy-number alterations (Consensus)" "Mutations" "Methylation (HM27)" ...
$ genetic_profile_description : chr "Putative copy-number calls for all genes in 203 GBM cases. Copy number calls were determined from the Agilent 2"| __truncated__ "Putative copy-number calls for genes implicated in glioblastoma (206 cases). These calls were used for the path"| __truncated__ "Mutation data for targeted sequencing in 91 primary glioblastoma tumor/normal pairs (Phases I/II of the TCGA gl"| __truncated__ "Methylation beta-values (Infinium HumanMethylation27 platform). For genes with multiple methylation probes, the"| __truncated__ ...
$ cancer_study_id : int 100 100 100 100 100 100 100 100 100 100
$ genetic_alteration_type : chr "COPY_NUMBER_ALTERATION" "COPY_NUMBER_ALTERATION" "MUTATION_EXTENDED" "METHYLATION" ...
$ show_profile_in_analysis_tab: chr "true" "true" "true" "false" ...
molecularProfiles(api = cbio, studyId = "gbm_tcga_pub") %>%
select(molecularProfileId, name, everything()) %>%
str()
tibble [10 × 8] (S3: tbl_df/tbl/data.frame)
$ molecularAlterationType : chr [1:10] "COPY_NUMBER_ALTERATION" "COPY_NUMBER_ALTERATION" "MUTATION_EXTENDED" "METHYLATION" ...
$ datatype : chr [1:10] "DISCRETE" "DISCRETE" "MAF" "CONTINUOUS" ...
$ name : chr [1:10] "Putative copy-number alterations (RAE)" "Putative copy-number alterations (Consensus)" "Mutations" "Methylation (HM27)" ...
$ description : chr [1:10] "Putative copy-number calls for all genes in 203 GBM cases. Copy number calls were determined from the Agilent 2"| __truncated__ "Putative copy-number calls for genes implicated in glioblastoma (206 cases). These calls were used for the path"| __truncated__ "Mutation data for targeted sequencing in 91 primary glioblastoma tumor/normal pairs (Phases I/II of the TCGA gl"| __truncated__ "Methylation beta-values (Infinium HumanMethylation27 platform). For genes with multiple methylation probes, the"| __truncated__ ...
$ showProfileInAnalysisTab: logi [1:10] TRUE TRUE TRUE FALSE FALSE TRUE ...
$ patientLevel : logi [1:10] FALSE FALSE FALSE FALSE FALSE FALSE ...
$ molecularProfileId : chr [1:10] "gbm_tcga_pub_cna_rae" "gbm_tcga_pub_cna_consensus" "gbm_tcga_pub_mutations" "gbm_tcga_pub_methylation_hm27" ...
$ studyId : chr [1:10] "gbm_tcga_pub" "gbm_tcga_pub" "gbm_tcga_pub" "gbm_tcga_pub" ...
library(tictoc)
tic("cgdsr:")
getProfileData.CGDS(x = cgds,
genes = c("NF1", "TP53", "ABL1"),
geneticProfiles = "gbm_tcga_pub_mrna",
caseList = "gbm_tcga_pub_all") %>%
head()
toc()
cgdsr:: 0.515 sec elapsed
# get all genPanelId
all_genePanelId <- genePanels(api = cbio) %>% pull(genePanelId)
## get all Genes entrez/symbol from all genePanelID, rm duplicates
all_genes_tbl <- lapply(X =all_genePanelId, function(x) getGenePanel(api = cbio, genePanelId = x)) %>%
bind_rows() %>%
distinct()
# group_by(entrezGeneId, hugoGeneSymbol) %>%
# filter(n()>1) %>%
# summarize(n=n(), .groups = "rowwise")
Our_gene_entrez <- all_genes_tbl %>%
filter(hugoGeneSymbol %in% c("NF1", "TP53", "ABL1")) %>%
pull(entrezGeneId)
## [1] 7157 25 4763
tic("cBioPortalData")
molecularData(api = cbio,
molecularProfileIds = "gbm_tcga_pub_mrna",
entrezGeneIds = Our_gene_entrez,
sampleIds = "gbm_tcga_pub_all")
toc()
named list()
cBioPortalData: 0.178 sec elapsed
The output is empty.
Try cBioPOrtalData
as mentioned in issue #30.
## with Enterez
cBioPortalData(
api = cbio,
studyId = "gbm_tcga",
#genePanelId = "AmpliSeq",
genes = Our_gene_entrez, #c("NF1", "P53", "BRCA1", "BRCA2"),
molecularProfileIds = "gbm_tcga_pub_mrna",
#sampleListId = "gbm_tcga_pub_all",
sampleIds = "gbm_tcga_pub_all",
by = "entrezGeneId" #, "hugoGeneSymbol"
)
Erreur dans split.default(x = seq_len(nrow(x)), f = f, drop = drop, ...) :
la taille du groupe est 0 mais la taille des données est > 0
## with Symbol
cBioPortalData(
api = cbio,
studyId = "gbm_tcga",
#genePanelId = "AmpliSeq",
genes = c("NF1", "P53", "BRCA1", "BRCA2"),
molecularProfileIds = "gbm_tcga_pub_mrna",
#sampleListId = "gbm_tcga_pub_all",
sampleIds = "gbm_tcga_pub_all",
by = "hugoGeneSymbol"
)
Erreur dans split.default(x = seq_len(nrow(x)), f = f, drop = drop, ...) :
la taille du groupe est 0 mais la taille des données est > 0
cBioPortalData
function and existing genePanelId
gbm <-cBioPortalData(api = cbio,
by = "hugoGeneSymbol",
studyId = "gbm_tcga",
genePanelId = "IMPACT341",
molecularProfileIds = "gbm_tcga_pub_mrna", #c("gbm_tcga_rppa", "gbm_tcga_mrna")
)
gbm@ExperimentList@listData$gbm_tcga_pub_mrna@assays@data@listData[[1]] %>%
as.data.frame() %>%
head
gbm@ExperimentList@listData$gbm_tcga_pub_mrna@assays@data@listData[[1]] %>%
rownames() %>%
grepl(.,c("NF1", "TP53","ABL1"))
[1] FALSE FALSE TRUE
ABL1 exists, but NF1 and TP53 do not exist.
getDataByGenes
## with Symbol
getDataByGenes(
api = cbio,
studyId = "gbm_tcga",
genes =c("NF1", "P53", "ABL1"),
#genePanelId = NA_character_,
by = "hugoGeneSymbol",
molecularProfileIds = "gbm_tcga_pub_mrna",
#sampleListId = "gbm_tcga_pub_all",
sampleIds = "gbm_tcga_pub_all"
)
# named list()
## With Entrez
getDataByGenes(
api = cbio,
studyId = "gbm_tcga",
genes = Our_gene_entrez,
#genePanelId = NA_character_,
by = "entrezGeneId",
molecularProfileIds = "gbm_tcga_pub_mrna",
#sampleListId = "gbm_tcga_pub_all",
sampleIds = "gbm_tcga_pub_all"
)
named list()
ABL1
is not returned!
getMutationData.CGDS(x=cgds,
caseList = "getMutationData",
geneticProfile = "gbm_tcga_pub_mutations",
genes = c("NF1", "TP53", "ABL1")) %>%
select(entrez_gene_id, gene_symbol, amino_acid_change, everything()) %>%
head()
getDataByGenes(
api = cbio,
studyId = "gbm_tcga",
genes = Our_gene_entrez,
#genePanelId = NA_character_,
by = "entrezGeneId",
molecularProfileIds = "gbm_tcga_pub_mutations",
#sampleListId = "gbm_tcga_pub_all",
sampleIds = "gbm_tcga_pub_all"
)
Erreur dans byGeneList[mutation] <- mutationData(api, molecularProfileIds[mutation], :
l'argument de remplacement est de longueur nulle
cBioPortalData(
api = cbio,
studyId = "gbm_tcga",
#genePanelId = "AmpliSeq",
genes = c("NF1", "P53", "BRCA1", "ABL1"),
molecularProfileIds = "gbm_tcga_pub_mutations",
#sampleListId = "gbm_tcga_pub_all",
sampleIds = "gbm_tcga_pub_all",
by = "hugoGeneSymbol"
)
Erreur dans byGeneList[mutation] <- mutationData(api, molecularProfileIds[mutation], :
l'argument de remplacement est de longueur nulle
Hi Karim, @kmezhoud
Thank you for this comprehensive comparison!
I can add this to the package as a vignette (with attribution ofc) for those looking to
migrate their code from cgds
to cBioPortalData
.
The examples you provided mixed the use of gbm_tcga
and gbm_tcga_pub
and that's why you were seeing empty responses.
The molecularData
operation could use a bit more flexibility in terms of inputs. I will work on a hugoGeneSymbol
input.
These are lower level functions and are not very user friendly. If you're looking to get to the data straightaway, you can simply
do:
cbio <- cBioPortal()
gbm_pub <- cBioPortalData(cbio, "gbm_tcga_pub", genes = c("NF1", "TP53", "ABL1"), by = "hugoGeneSymbol", molecularProfileIds = "gbm_tcga_pub_mrna")
assay(gbm_pub[["gbm_tcga_pub_mrna"]])
Best regards, Marcel
Update: I've added the ability to query the API for a table of gene symbols:
cbio <- cBioPortal()
queryGeneTable(cbio,
by = "hugoGeneSymbol",
genes = c("NF1", "TP53", "ABL1")
)
and a vignette to allow developers to migrate from cgds
to cBioPortalData
at https://github.com/waldronlab/cBioPortalData/blob/devel/vignettes/cgdsMigration.Rmd
Your feedback is welcome. Thanks!
Dear all, I suppose that all packages depending on
cgdsr
will usecBioPortalData
. Concretely, 1- Is there an equivalent to theses 6 commands? 2- The structure of data incgdsr
remain the same incBioPortalData
with:Rapidly I saw that
getCancerStudies
is mutated bygetStudies
...Thanks Karim