waldronlab / cBioPortalData

Integrate the cancer genomics portal, cBioPortal, using MultiAssayExperiment
https://waldronlab.io/cBioPortalData/
30 stars 12 forks source link

how to get molecularData with out entrezGeneIds #57

Closed limbo1996 closed 2 years ago

limbo1996 commented 2 years ago

Hello When I use molecularData(), I find entrezGeneIds is necessary. Like

test <- molecularData(cbio, 
                      molecularProfileId = "acc_tcga_rna_seq_v2_mrna",
                      entrezGeneIds = 1:1000, # a range of entrezGeneIds
                      sampleIds = c("TCGA-OR-A5J1-01",  "TCGA-OR-A5J2-01")
                      )

But this is only part of the data for these two samples. How do I get all the data for these two samples in acc_tcga_rna_seq_v2_mrna when I don't need to enter the range of entrezGeneIds ? Thanks a lot.

LiNk-NY commented 2 years ago

Hi @limbo1996 The API was designed to take slices of the data, thus entreGeneIds or HugoSymbols are required. If you'd like to get all the data, you can try the bulk method by doing:

acc <- cBioDataPack("acc_tcga")
acc
#' A MultiAssayExperiment object of 11 listed
#'  experiments with user-defined names and respective classes.
#'  Containing an ExperimentList class object of length 11:
#'  [1] cna_hg19.seg: RaggedExperiment with 16080 rows and 90 columns
#'  [2] CNA: SummarizedExperiment with 24776 rows and 90 columns
#'  [3] linear_CNA: SummarizedExperiment with 24776 rows and 90 columns
#'  [4] methylation_hm450: SummarizedExperiment with 15755 rows and 80 columns
#'  [5] mutations_extended: RaggedExperiment with 20166 rows and 90 columns
#'  [6] mutations_mskcc: RaggedExperiment with 20166 rows and 90 columns
#'  [7] RNA_Seq_v2_expression_median: SummarizedExperiment with 20531 rows and 79 columns
#'  [8] RNA_Seq_v2_mRNA_median_all_sample_Zscores: SummarizedExperiment with 20531 rows and 79 columns
#'  [9] RNA_Seq_v2_mRNA_median_Zscores: SummarizedExperiment with 20440 rows and 79 columns
#'  [10] rppa_Zscores: SummarizedExperiment with 191 rows and 46 columns
#'  [11] rppa: SummarizedExperiment with 192 rows and 46 columns
#' Functionality:
#'  experiments() - obtain the ExperimentList instance
#'  colData() - the primary/phenotype DataFrame
#'  sampleMap() - the sample coordination DataFrame
#'  `$`, `[`, `[[` - extract colData columns, subset, or experiment
#'  *Format() - convert into a long or wide DataFrame
#'  assays() - convert ExperimentList to a SimpleList of matrices
#'  exportClass() - save data to flat files

And then filtering out by sampleId

limbo1996 commented 2 years ago

@LiNk-NY Thanks for your reply! But when I use cBioDataPack, return:

Warning messages:
1: Unable to import: mrna_seq_v2_rsem
Reason: missing value where TRUE/FALSE needed 
2: Unable to import: mrna_seq_v2_rsem_zscores_ref_all_samples
Reason: missing value where TRUE/FALSE needed 
3: In .find_with_xfix(df_colnames, get(paste0(fix, 1)), get(paste0(fix,  :
   Multiple prefixes found, using keyword 'region' or taking first one
4: In .find_with_xfix(df_colnames, get(paste0(fix, 1)), get(paste0(fix,  :
   Multiple prefixes found, using keyword 'region' or taking first one

So what can I do to import all mrna files? Thanks again!

LiNk-NY commented 2 years ago

Hi @limbo1996

Thanks for pointing this out. It seems to be an issue with missing rownames in the data and the way the SummarizedExperiment constructor function handles name checks. If interested, you can follow the issue here:

https://github.com/Bioconductor/SummarizedExperiment/issues/64

There is an issue on the curation side AFAICT. When reading the data manually, there are NA in the Hugo_Symbol column. You can use downloadStudy and then untarStudy to inspect the contents of the tarball.

Best, Marcel