waldronlab / cBioPortalData

Integrate the cancer genomics portal, cBioPortal, using MultiAssayExperiment
https://waldronlab.io/cBioPortalData/
30 stars 12 forks source link

cBioDataPack Error for cesc_tcga_pan_can_atlas_2018 #28

Closed jebard closed 4 years ago

jebard commented 4 years ago

Hello! I was wondering if anyone has encountered this problem when loading the study "cesc_tcga_pan_can_atlas_2018".

Is there a known work around?

cesc_pan_2018 <- cBioDataPack("cesc_tcga_pan_can_atlas_2018")

Parsed with column specification: cols( .default = col_character() ) See spec(...) for full column specifications. Error in seqlevels[rankSeqlevels(seqlevels)] <- seqlevels : NAs are not allowed in subscripted assignments.

lwaldron commented 4 years ago

Thanks for the report @jebard - it's been a bit busy prepping for BioC2020 but we'll get to this as soon as possible. As a workaround in the meantime, you might try the API method, here's some example code. I have to admit I waited around a while for the last line but not long enough to get the result from the API. This is a big study and this call might be trying to ask for too much in one call.

library(cBioPortalData)
cbio <- cBioPortal()
View(genePanels(cbio))
View(geneTable(cbio))

(mp <- molecularProfiles(cbio, studyId = "cesc_tcga_pan_can_atlas_2018"))
(sl <- sampleLists(cbio, studyId = "cesc_tcga_pan_can_atlas_2018"))
samples <-  samplesInSampleLists(cbio, sampleListIds = sl$sampleListId)

molecularData(
  api = cbio,
  molecularProfileId = "cesc_tcga_pan_can_atlas_2018_rna_seq_v2_mrna",
  sampleIds = samples$cesc_tcga_pan_can_atlas_2018_all,
  entrezGeneIds = c(1, 2),
)

clinicalData(cbio, studyId = "cesc_tcga_pan_can_atlas_2018")

res <- cBioPortalData(cbio, 
                      genePanelId = "grail_cfdna_508",
                      studyId = "cesc_tcga_pan_can_atlas_2018")
jebard commented 4 years ago

Thanks very much for following up. I'll give the code snippet provided a try. I traced the issue a little bit on my end and think its isolated to the way that the data tables are being read in -- specifically this line within the function cBioDataPack()

dat <- as.data.frame(readr::read_tsv(fname, comment = "#"),check.names = FALSE)

Temporarily to get around it I swapped this line out for: dat <- read.table(fname,header = T,fill = T)

And that seems to return reasonable results, though its hard to validate if it actually processes all of the tables appropriately when I do this. I have a feeling its some sort of oddity with that particular study.

Thanks!

LiNk-NY commented 4 years ago

Hi Jonathan, @jebard

I've had a look into this and made a change in the underlying code. https://github.com/waldronlab/cBioPortalData/commit/80fc587158b9ff52096f122fd41ae634132ef5e7

I've made a couple of changes:

  1. switched to using read.delim(sep = "\t")
  2. removed rows that have an NA chromosome value after readin

The source of the issue was that readr::read_tsv converts the chromosome X values into NA

library(cBioPortalData)
cesc_pan_2018 <- cBioDataPack("cesc_tcga_pan_can_atlas_2018")

tarloc <- downloadStudy("cesc_tcga_pan_can_atlas_2018")

outdir <- file.path(tempdir(), "cesc")
dir.create(outdir)
studyloc <- untarStudy(tarloc, exdir = outdir)

## currently in use
a <- readr::read_delim(file.path(studyloc, "data_mutations_extended.txt"),
    comment = "#", delim = "\t")
table(a$Chromosome, useNA="always")

## possible alternative
b <- readr::read_tsv(file.path(studyloc, "data_mutations_extended.txt"),
    comment = "#", col_types = cols(.default = col_character()))
## necessary step to convert actual numeric columns to numeric
bb <- type_convert(b)
table(bb$Chromosome, useNA="always")

## proposed change
c <- read.delim(file.path(studyloc, "data_mutations_extended.txt"),
    comment.char = "#")
table(c$Chromosome, useNA="always")

system.time({
    a <- readr::read_tsv(file.path(studyloc, "data_mutations_extended.txt"),
        comment = "#", col_types = cols(.default = col_character()))
    type_convert(a) 
})

## faster
system.time({
    c <- read.delim(file.path(studyloc, "data_mutations_extended.txt"),
        comment.char = "#")
})
jebard commented 4 years ago

Perfect! Thanks for taking a look, appreciate it!