waldronlab / TCGAutils

Toolbox package for organizing and working with TCGA data
https://bioconductor.org/packages/TCGAutils
22 stars 6 forks source link

Filename Conversion to TCGA ID Incorrect #22

Closed DarioS closed 4 years ago

DarioS commented 5 years ago

The incorrect reporting of Submitter ID happens if I use hundreds of filenames, not just one filename.

gdc_manifest_20190212_042717.txt

manifest <- read.delim("gdc_manifest_20190212_042717.txt")
conversionTable <- filenameToBarcode(manifest[, "filename"])
> conversionTable[95, ]
                                                   file_name                              file_id        aliquots.submitter_id
95 33098d4a-c424-4d7a-ba13-3c5c50d9d6ac_gdc_realn_rehead.bam 70253fd8-5f0c-4826-a605-1195afd6d4c6 TCGA-CN-4737-01A-01R-1436-07

This is the incorrect Submitter ID, as the data portal shows.

image

If I provide one file name, then the function determines the correct Submitter ID (it's the same as shown in the data portal).

> filenameToBarcode("33098d4a-c424-4d7a-ba13-3c5c50d9d6ac_gdc_realn_rehead.bam")
                                                  file_name                              file_id        aliquots.submitter_id
1 33098d4a-c424-4d7a-ba13-3c5c50d9d6ac_gdc_realn_rehead.bam 98fdc192-ad6d-4a92-8827-fa2ea8c61615 TCGA-CV-5977-01A-11R-1686-07

There's a vector sorting bug somewhere.

LiNk-NY commented 5 years ago

Hi Dario, @DarioS Thanks for reporting this. I'll look into it. Best, Marcel

LiNk-NY commented 5 years ago

Hi Dario, @DarioS This is fixed in https://github.com/waldronlab/TCGAutils/commit/d6481be0bf3bd60c96e60f4d5768674722b0695c.

galder-max commented 4 years ago

Hi, I am still seeing the same issue running TCGAutils_1.7.11: one filename vs. tens of filenames outputs different results.

There is also an error thrown when using thousands of them: Error in data.frame(file_name = info[["file_name"]], file_id = info[["file_id"]], : arguments imply differing number of rows: 11038, 22075

Inputting one filename usually returns a data.frame with 2 rows not 1 row, which seems likely related to this issue (11038*2=22076).

galder-max commented 4 years ago

Hi, I am still seeing the same issue running TCGAutils_1.7.11: one filename vs. tens of filenames outputs different results.

There is also an error thrown when using thousands of them: Error in data.frame(file_name = info[["file_name"]], file_id = info[["file_id"]], : arguments imply differing number of rows: 11038, 22075

Inputting one filename usually returns a data.frame with 2 rows not 1 row, which seems likely related to this issue (11038*2=22076).

Some files, like vcf files, correspond to more than one aliquot (here 2 aliquots = normal+tumour), which explains this. There is one vcf in my list for which only the aliquot of the germline is returned.

Now using this in the function: aliquots.submitter_id = sapply(info$cases, function(x) paste(unlist(x),collapse=","))

LiNk-NY commented 4 years ago

Hi @galder-max Do you have a small reproducible example? That would help me to get this moving. Thanks so much! -Marcel

galder-max commented 4 years ago

Hi Marcel, many thanks for looking into this so quickly! Sure, here is a minimal R example, hopefully it illustrates the issue well:

# minimal example with three vcf filenames
vcf_filenames  <-  c("0000fac0-cd56-457d-bab9-2ae9bdd9a93c.vcf.gz",
                     "7c7e28e5-39e2-4f29-8cdf-7af776614a42.vcf.gz", 
                     "000541c3-705e-495f-ac48-148449f40e10.vcf.gz")

# getting aliquots separately: second vcf returns only one aliquot (weirdly)
# problem1: only the first of the aliquots is returned twice (can be either normal or tumour)
lapply(vcf_filenames, function(vcf_filename) 
filenameToBarcode(vcf_filename, 
              legacy=F))

# problem2: thus crashes when done together, as the current unlist leads to a vector of different size (and not a multiple of the size) compared to the other elements of info:
filenameToBarcode(vcf_filenames[c(1,2)], 
              legacy=F)

# runs without error with more than one vcf if both filenames return the same number of aliquots
filenameToBarcode(vcf_filenames[c(1,3)], 
              legacy=F)

# this is solved when pasting the aliquots within each element of info[["cases"]] before unlisting/simplifying to character vector:

library(GenomicDataCommons)

filenameToBarcode <- function (filenames, legacy = FALSE) 
{  
    filesres <- files(legacy = legacy)
    info <- results_all(select(filter(filesres, ~file_name %in%filenames),
                               c("file_name", "cases.samples.portions.analytes.aliquots.submitter_id")))
    res <- data.frame(file_name = info[["file_name"]],
                      file_id = info[["file_id"]], 
                      aliquots.submitter_id = sapply(info[["cases"]], function(cases) paste(unlist(cases),collapse=",")), 
                      row.names = NULL, stringsAsFactors = FALSE)
    res[na.omit(match(res[["file_name"]], filenames)), ]
}

# no more crash
filenameToBarcode(vcf_filenames[c(1,2)], 
              legacy=F)

# both aliquots are properly returned
lapply(vcf_filenames, function(vcf_filename) 
filenameToBarcode(vcf_filename, 
              legacy=F))

You might want to go for a different option than collapsing, it does the job for my purpose. Hope this helps.

LiNk-NY commented 4 years ago

Thanks @galder-max I am looking at this now.

LiNk-NY commented 4 years ago

Fixed in cc02730e0f7a8558472ae289e93b6f277db95449 Thanks for the report.