waldronlab / curatedTCGAData

Curated Data From The Cancer Genome Atlas (TCGA) as MultiAssayExperiment Objects
https://bioconductor.org/packages/curatedTCGAData
41 stars 7 forks source link

genome tag for mutation RaggedExperiment seems peculiar #40

Closed vjcitn closed 3 years ago

vjcitn commented 3 years ago

This is subsequent to

acc = curatedTCGAData("ACC", "Mutation", dry=FALSE)
ra = experiments(acc)[[1]]
> rowRanges(ra)
GRanges object with 20166 ranges and 0 metadata columns:
          seqnames              ranges strand
             <Rle>           <IRanges>  <Rle>
      [1]        1            11561526      +
      [2]        1            12309384      +
      [3]        1            33820015      +
      [4]        1 152785074-152785097      +
      [5]        1           152800122      +
      ...      ...                 ...    ...
  [20162]        5 131007363-131007364      +
  [20163]        7   90894459-90894460      +
  [20164]        9 139581758-139581759      +
  [20165]       16   90095596-90095597      +
  [20166]       19   58385798-58385799      +
  -------
  seqinfo: 24 sequences from 37 genome; no seqlengths

Should this be b37 or GRCh37?

LiNk-NY commented 3 years ago

Hi Vince, @vjcitn Sorry for the late response. The data from the GDAC Firehose Pipeline comes with a column NCBI_Build with values of 37. I think I have a helper function to convert this to GRCh37. I will update the pipeline to do this.

suppressPackageStartupMessages({
    library(RTCGAToolbox)
    library(SummarizedExperiment)
})
acc <- getFirehoseData("ACC", Mutation = TRUE)
#> gdac.broadinstitute.org_ACC.Clinical_Pick_Tier1.Level_4.2016012800.0.0
rag <- biocExtract(acc, "Mutation")
#> working on: Mutation
rowRanges(rag)
#> GRanges object with 20166 ranges and 0 metadata columns:
#>           seqnames              ranges strand
#>              <Rle>           <IRanges>  <Rle>
#>       [1]        1            11561526      +
#>       [2]        1            12309384      +
#>       [3]        1            33820015      +
#>       [4]        1 152785074-152785097      +
#>       [5]        1           152800122      +
#>       ...      ...                 ...    ...
#>   [20162]        5 131007363-131007364      +
#>   [20163]        7   90894459-90894460      +
#>   [20164]        9 139581758-139581759      +
#>   [20165]       16   90095596-90095597      +
#>   [20166]       19   58385798-58385799      +
#>   -------
#>   seqinfo: 24 sequences from 37 genome; no seqlengths
head(acc@Mutation)[1:5]
#>   Hugo_Symbol Entrez_Gene_Id                 Center NCBI_Build Chromosome
#> 1       ACAP3         116983 broad.mit.edu;bcgsc.ca         37          1
#> 2        NOL9          79707           hgsc.bcm.edu         37          1
#> 3        NOL9          79707           hgsc.bcm.edu         37          1
#> 4         SRM           6723           hgsc.bcm.edu         37          1
#> 5       DHRS3           9249               bcgsc.ca         37          1
#> 6       OPRD1           4985           hgsc.bcm.edu         37          1

Created on 2020-09-30 by the reprex package (v0.3.0)

LiNk-NY commented 3 years ago

Fixed in https://github.com/waldronlab/TCGAutils/commit/ba3d11b49c766d513a31d2aac001def6afa52d66. Porting to curatedTCGAData data version 2.0.0.