waldronlab / cBioPortalData

Integrate the cancer genomics portal, cBioPortal, using MultiAssayExperiment
https://waldronlab.io/cBioPortalData/
30 stars 12 forks source link

pancan_pcawg_2020 and pan_origimed_2020 #62

Open mjsteinbaugh opened 1 year ago

mjsteinbaugh commented 1 year ago

Hi Waldron Lab,

I'm working on migrating my cBioPortal workflow code to use cBioPortalData, and the package is really excellent. Great work. One thing that I've noticed is that pancan_pcawg_2020 doesn't appear to be supported by the main cBioPortalData() or cBioDataPack() functions.

I checked using this code:

api <- cBioPortalData:::.loadReportData()[["api_build"]]
pack <- cBioPortalData:::.loadReportData()[["pack_build"]]

See related dataset: https://www.cbioportal.org/study/summary?id=pancan_pcawg_2020

I'm happy to help add support for this dataset if you guys can walk me through it. One other question I have is what if the package provided download support for pre-processed MultiAssayExperiment objects instead of the pack file approach? Is that doable?

Best, Mike

mjsteinbaugh commented 1 year ago

Also, the pan_origimed_2020 dataset would be a really helpful addition.

https://www.cbioportal.org/study/summary?id=pan_origimed_2020

mjsteinbaugh commented 1 year ago

Found a minor bug with BiocFileCache call -- usage of cBioDataPack() with ask = FALSE still currently prompts the user to create the cBioPortalData BiocFileCache directory if it doesn't exist.

LiNk-NY commented 1 year ago

Hi Michael, @mjsteinbaugh I've tested so far with pancan_pcawg_2020 and it looks like only the mutation data can be represented.

cBioDataPack("pancan_pcawg_2020", check_build = FALSE)
Study file in cache: pancan_pcawg_2020
Working on: /tmp/RtmpO7UP1B/bb3743edba3_pancan_pcawg_2020/pancan_pcawg_2020/data_cna.txt
Working on: /tmp/RtmpO7UP1B/bb3743edba3_pancan_pcawg_2020/pancan_pcawg_2020/data_mirna_zscores.txt
Working on: /tmp/RtmpO7UP1B/bb3743edba3_pancan_pcawg_2020/pancan_pcawg_2020/data_mirna.txt
Working on: /tmp/RtmpO7UP1B/bb3743edba3_pancan_pcawg_2020/pancan_pcawg_2020/data_mrna_seq_fpkm_zscores_ref_all_samples.txt
Working on: /tmp/RtmpO7UP1B/bb3743edba3_pancan_pcawg_2020/pancan_pcawg_2020/data_mrna_seq_fpkm.txt
Working on: /tmp/RtmpO7UP1B/bb3743edba3_pancan_pcawg_2020/pancan_pcawg_2020/data_mutations.txt
Working on: /tmp/RtmpO7UP1B/bb3743edba3_pancan_pcawg_2020/pancan_pcawg_2020/data_timeline_status.txt
harmonizing input:
  removing 18 colData rownames not in sampleMap 'primary'
A MultiAssayExperiment object of 1 listed
 experiment with a user-defined name and respective class.
 Containing an ExperimentList class object of length 1:
 [1] mutations: RaggedExperiment with 382937 rows and 2683 columns
Functionality:
 experiments() - obtain the ExperimentList instance
 colData() - the primary/phenotype DataFrame
 sampleMap() - the sample coordination DataFrame
 `$`, `[`, `[[` - extract colData columns, subset, or experiment
 *Format() - convert into a long or wide DataFrame
 assays() - convert ExperimentList to a SimpleList of matrices
 exportClass() - save data to flat files

You can download the data manually using downloadStudy and take a look at the contents. I am not sure why the CNA and other datasets are not being built. I will take a closer look later.

Use version 2.9.11 or greater.

Best, Marcel

mjsteinbaugh commented 1 year ago

Thanks @LiNk-NY I'll take a look and get back to you. I'm primarily interested in the CNA data for both datasets, which I can query via the API but not the main recommended functions in the package.