Closed DarioS closed 4 years ago
Hi Dario, @DarioS Thanks for reporting this. I'll look into it. Best, Marcel
Hi Dario, @DarioS This is fixed in https://github.com/waldronlab/TCGAutils/commit/d6481be0bf3bd60c96e60f4d5768674722b0695c.
Hi, I am still seeing the same issue running TCGAutils_1.7.11: one filename vs. tens of filenames outputs different results.
There is also an error thrown when using thousands of them: Error in data.frame(file_name = info[["file_name"]], file_id = info[["file_id"]], : arguments imply differing number of rows: 11038, 22075
Inputting one filename usually returns a data.frame with 2 rows not 1 row, which seems likely related to this issue (11038*2=22076).
Hi, I am still seeing the same issue running TCGAutils_1.7.11: one filename vs. tens of filenames outputs different results.
There is also an error thrown when using thousands of them: Error in data.frame(file_name = info[["file_name"]], file_id = info[["file_id"]], : arguments imply differing number of rows: 11038, 22075
Inputting one filename usually returns a data.frame with 2 rows not 1 row, which seems likely related to this issue (11038*2=22076).
Some files, like vcf files, correspond to more than one aliquot (here 2 aliquots = normal+tumour), which explains this. There is one vcf in my list for which only the aliquot of the germline is returned.
Now using this in the function: aliquots.submitter_id = sapply(info$cases, function(x) paste(unlist(x),collapse=","))
Hi @galder-max Do you have a small reproducible example? That would help me to get this moving. Thanks so much! -Marcel
Hi Marcel, many thanks for looking into this so quickly! Sure, here is a minimal R example, hopefully it illustrates the issue well:
# minimal example with three vcf filenames
vcf_filenames <- c("0000fac0-cd56-457d-bab9-2ae9bdd9a93c.vcf.gz",
"7c7e28e5-39e2-4f29-8cdf-7af776614a42.vcf.gz",
"000541c3-705e-495f-ac48-148449f40e10.vcf.gz")
# getting aliquots separately: second vcf returns only one aliquot (weirdly)
# problem1: only the first of the aliquots is returned twice (can be either normal or tumour)
lapply(vcf_filenames, function(vcf_filename)
filenameToBarcode(vcf_filename,
legacy=F))
# problem2: thus crashes when done together, as the current unlist leads to a vector of different size (and not a multiple of the size) compared to the other elements of info:
filenameToBarcode(vcf_filenames[c(1,2)],
legacy=F)
# runs without error with more than one vcf if both filenames return the same number of aliquots
filenameToBarcode(vcf_filenames[c(1,3)],
legacy=F)
# this is solved when pasting the aliquots within each element of info[["cases"]] before unlisting/simplifying to character vector:
library(GenomicDataCommons)
filenameToBarcode <- function (filenames, legacy = FALSE)
{
filesres <- files(legacy = legacy)
info <- results_all(select(filter(filesres, ~file_name %in%filenames),
c("file_name", "cases.samples.portions.analytes.aliquots.submitter_id")))
res <- data.frame(file_name = info[["file_name"]],
file_id = info[["file_id"]],
aliquots.submitter_id = sapply(info[["cases"]], function(cases) paste(unlist(cases),collapse=",")),
row.names = NULL, stringsAsFactors = FALSE)
res[na.omit(match(res[["file_name"]], filenames)), ]
}
# no more crash
filenameToBarcode(vcf_filenames[c(1,2)],
legacy=F)
# both aliquots are properly returned
lapply(vcf_filenames, function(vcf_filename)
filenameToBarcode(vcf_filename,
legacy=F))
You might want to go for a different option than collapsing, it does the job for my purpose. Hope this helps.
Thanks @galder-max I am looking at this now.
Fixed in cc02730e0f7a8558472ae289e93b6f277db95449 Thanks for the report.
The incorrect reporting of Submitter ID happens if I use hundreds of filenames, not just one filename.
gdc_manifest_20190212_042717.txt
This is the incorrect Submitter ID, as the data portal shows.
If I provide one file name, then the function determines the correct Submitter ID (it's the same as shown in the data portal).
There's a vector sorting bug somewhere.