waldronlab / curatedTCGAData

Curated Data From The Cancer Genome Atlas (TCGA) as MultiAssayExperiment Objects
https://bioconductor.org/packages/curatedTCGAData
44 stars 7 forks source link

seqinfo for GRanges elements? #16

Closed vjcitn closed 3 years ago

vjcitn commented 6 years ago

note the unspecified genome/no seqlengths at the end

> gbmMAE
A MultiAssayExperiment object of 4 listed
 experiments with user-defined names and respective classes. 
 Containing an ExperimentList class object of length 4: 
 [1] GBM_CNASNP-20160128: RaggedExperiment with 602338 rows and 1104 columns 
 [2] GBM_mRNAArray_huex-20160128: SummarizedExperiment with 18632 rows and 431 columns 
 [3] GBM_mRNAArray_TX_g4502a-20160128: SummarizedExperiment with 17814 rows and 502 columns 
 [4] GBM_mRNAArray_TX_ht_hg_u133a-20160128: SummarizedExperiment with 12042 rows and 528 columns 
Features: 
 experiments() - obtain the ExperimentList instance 
 colData() - the primary/phenotype DataFrame 
 sampleMap() - the sample availability DataFrame 
 `$`, `[`, `[[` - extract colData columns, subset, or experiment 
 *Format() - convert into a long or wide DataFrame 
 assays() - convert ExperimentList to a SimpleList of matrices
> rowRanges(experiments(gbmMAE)[[1]])
GRanges object with 602338 ranges and 0 metadata columns:
           seqnames              ranges strand
              <Rle>           <IRanges>  <Rle>
       [1]        1      61735-25418699      *
       [2]        1   25423401-25424322      *
       [3]        1   25424889-25583341      *
       [4]        1   25593128-25662212      *
       [5]        1   25663310-72750353      *
       ...      ...                 ...    ...
  [602334]       23 148888090-148888542      *
  [602335]       23 148888898-152528086      *
  [602336]       23 152528150-152531276      *
  [602337]       23 152532889-155182354      *
  [602338]       24    2650438-59018259      *
  -------
  seqinfo: 24 sequences from an unspecified genome; no seqlengths
LiNk-NY commented 6 years ago

Hi Vince, @vjcitn

Thanks for pointing this out. I will see if I can modify the code in WaldronLab/MultiAssayExperiment-TCGA to update datasets with genome info.

I remember doing this in the past but I'm not sure if it worked for all datasets. Essentially, it tries to provide build information from either the file names or a column in the data.

Regards, Marcel

vjcitn commented 6 years ago

thanks. if it can't be done reliably upstream we should have some tools to simplify binding appropriate seqinfo by the user. this would also be helpful in adding range information to SEs that are rownamed by gene symbols.

On Wed, Jun 20, 2018 at 11:25 AM, Marcel Ramos notifications@github.com wrote:

Hi Vince, @vjcitn https://github.com/vjcitn

Thanks for pointing this out. I will see if I can modify the code in WaldronLab/MultiAssayExperiment-TCGA https://github.com/waldronlab/MultiAssayExperiment-TCGA to update datasets with genome info.

I remember doing this in the past but I'm not sure if it worked for all datasets. Essentially, it tries to provide build information from either the file names or a column in the data.

Regards, Marcel

— You are receiving this because you were mentioned. Reply to this email directly, view it on GitHub https://github.com/waldronlab/curatedTCGAData/issues/16#issuecomment-398791974, or mute the thread https://github.com/notifications/unsubscribe-auth/AEaOwmMzavcnhWl-Bj7ChZjiYJptVY9mks5t-mlvgaJpZM4UqdE- .

lwaldron commented 6 years ago

It is increasingly difficult to find documentation on the Broad Firehose Pipeline, and I find contradictory information. Even this FAQ seems to indicate uncertainty:

From https://confluence.broadinstitute.org/display/GDAC/FAQ#FAQ-Q%C2%A0Whatreferencegenomebuildareyouusing

Q: What reference genome build are you using? A: We match the reference genome used in our analyses to the reference used to generate the data as appropriate. Our understanding is that TCGA standards stipulate that OV, COAD/READ, and LAML data are hg18, and all else is hg19. caveat: SNP6 copy number data is available in both hg18 and hg19 for all tumor cohorts, so we use hg19 for copy number analyses in all cases.

From https://confluence.broadinstitute.org/display/GDAC/FAQ#FAQ-EndOfTCGAQIunderstandthatTCGAdatahasmigratedtotheGDCbutwhydoIseediscrepanciesbetweenGDCandFireBrowse this states (although I believe incorrectly) that GDAC Firehose & FireBrowse portals ONLY serve HG19 data. Note we are using Firehose legacy data, and not through GDC.

Q: I understand that TCGA data has migrated to the GDC, but why do I see discrepancies between GDC and FireBrowse? A: Note that the GDC serves both HG38 and HG19 data. The HG19 data are considered “legacy” and represent the original calls as made by each of the sequencing centers in TCGA; they ARE NOT the default data served by the GDC, and instead are served from the (slightly hidden) legacy archive section of the GDC portal. By default the public GDC interface serves HG38 data; these are newly generated at the GDC itself, with the intent to smooth over differences across the entire set of TCGA samples by “harmonizing” them with common variant callers and reference data. It is important to understand that these HG38 data are not the original HG19 legacy data that is discussed in most of the current TCGA publications. Lastly, note that the public GDAC Firehose & FireBrowse portals ONLY serve HG19 data; we’ve been reluctant to release HG38 data (and analyses of them) to the general public until they have gone through more in-depth QC/vetting. This QC has not been fully completed yet, but is an active area of investigation (with an analysis working group, or AWG) within the nascent GDAN. We are aiming to have a first release of HG38 GDAC pipelines in FireBrowse by Q1 of 2018, after the QC group completes its assesment to the satisfaction of the NCI.

lwaldron commented 6 years ago

I also drafted a function for adding ranges to those SummarizedExperiments with rownames as genes in curatedTCGAData, using hg19. It's pretty specific to curatedTCGAData and has a hack (as with the other gist I recently posted) to get around being able to concatenate to a MultiAssayExperiment with the desired name. Would require some testing and cleaning to put in the package, but let me know if it seems useful:

https://gist.github.com/lwaldron/63b403803e91b3a3ce72592fa6e85f79

> symbolsToRanges(miniACC)
'select()' returned 1:1 mapping between keys and columns
'select()' returned 1:1 mapping between keys and columns
'select()' returned 1:1 mapping between keys and columns
'select()' returned 1:1 mapping between keys and columns
'select()' returned 1:1 mapping between keys and columns
'select()' returned 1:1 mapping between keys and columns
A MultiAssayExperiment object of 7 listed
 experiments with user-defined names and respective classes. 
 Containing an ExperimentList class object of length 7: 
 [1] Mutations: matrix with 97 rows and 90 columns 
 [2] miRNASeqGene: SummarizedExperiment with 471 rows and 80 columns 
 [3] RNASeq2GeneNorm_ranged: RangedSummarizedExperiment with 195 rows and 79 columns 
 [4] RNASeq2GeneNorm_unranged: SummarizedExperiment with 3 rows and 79 columns 
 [5] gistict_ranged: RangedSummarizedExperiment with 195 rows and 90 columns 
 [6] gistict_unranged: SummarizedExperiment with 3 rows and 90 columns 
 [7] RPPAArray_ranged: RangedSummarizedExperiment with 33 rows and 46 columns 
Features: 
 experiments() - obtain the ExperimentList instance 
 colData() - the primary/phenotype DataFrame 
 sampleMap() - the sample availability DataFrame 
 `$`, `[`, `[[` - extract colData columns, subset, or experiment 
 *Format() - convert into a long or wide DataFrame 
 assays() - convert ExperimentList to a SimpleList of matrices
> 
vjcitn commented 6 years ago

Thanks for this information and the gists. I will try to look at them this weekend.

On Fri, Jun 22, 2018 at 7:41 AM, Levi Waldron notifications@github.com wrote:

It is increasingly difficult to find documentation on the Broad Firehose Pipeline, and I find contradictory information. Even this FAQ seems to indicate uncertainty:

From https://confluence.broadinstitute.org/display/GDAC/FAQ#FAQ-Q%C2% A0Whatreferencegenomebuildareyouusing

Q: What reference genome build are you using? A: We match the reference genome used in our analyses to the reference used to generate the data as appropriate. Our understanding is that TCGA standards stipulate that OV, COAD/READ, and LAML data are hg18, and all else is hg19. caveat: SNP6 copy number data is available in both hg18 and hg19 for all tumor cohorts, so we use hg19 for copy number analyses in all cases.

From https://confluence.broadinstitute.org/display/GDAC/FAQ#FAQ- EndOfTCGAQIunderstandthatTCGAdatahasmigratedtotheGDCbutwhydo IseediscrepanciesbetweenGDCandFireBrowse this states (although I believe incorrectly) that GDAC Firehose & FireBrowse portals ONLY serve HG19 data:

Q: I understand that TCGA data has migrated to the GDC, but why do I see discrepancies between GDC and FireBrowse? A: Note that the GDC serves both HG38 and HG19 data. The HG19 data are considered “legacy” and represent the original calls as made by each of the sequencing centers in TCGA; they ARE NOT the default data served by the GDC, and instead are served from the (slightly hidden) legacy archive section of the GDC portal. By default the public GDC interface serves HG38 data; these are newly generated at the GDC itself, with the intent to smooth over differences across the entire set of TCGA samples by “harmonizing” them with common variant callers and reference data. It is important to understand that these HG38 data are not the original HG19 legacy data that is discussed in most of the current TCGA publications. Lastly, note that the public GDAC Firehose & FireBrowse portals ONLY serve HG19 data; we’ve been reluctant to release HG38 data (and analyses of them) to the general public until they have gone through more in-depth QC/vetting. This QC has not been fully completed yet, but is an active area of investigation (with an analysis working group, or AWG) within the nascent GDAN. We are aiming to have a first release of HG38 GDAC pipelines in FireBrowse by Q1 of 2018, after the QC group completes its assesment to the satisfaction of the NCI.

— You are receiving this because you were mentioned. Reply to this email directly, view it on GitHub https://github.com/waldronlab/curatedTCGAData/issues/16#issuecomment-399414291, or mute the thread https://github.com/notifications/unsubscribe-auth/AEaOwieG2_U5dDA_s2qmTagSo-3KNtqqks5t_NfogaJpZM4UqdE- .

lwaldron commented 6 years ago

Skip the gists now, and just try the conveniencefuns branch. They're documented there and have an additional "all-in-one" simplifyTCGA function (demo at on issue #18 ).

LiNk-NY commented 3 years ago

Moved original issue to #40. simplifyTCGA is now in TCGAutils.