rli012 / GDCRNATools

GDCRNATools: an R/Bioconductor package for integrative analysis of lncRNA, miRNA and mRNA data in GDC
Apache License 2.0
67 stars 43 forks source link

Error Downloading RNAseq Data with gdcRNADownload() #20

Open Josuerinho opened 2 years ago

Josuerinho commented 2 years ago

Hi all!!

I've been trying to use the function gdcRNADownload() to download RNAseq data from TCGA but no matter what RNAseq type I try, I always get the same error:

Successfully downloaded: 0 Warning message: In read.table(paste(url, "&return_type=manifest", sep = ""), header = TRUE, : incomplete final line found by readTableHeader on 'https://api.gdc.cancer.gov/files?filters=%7B%22op%22:%22and%22,%22content%22:[%7B%22op%22:%22in%22,%22content%22:%7B%22field%22:%22cases.project.project_id%22,%22value%22:[%22TCGA-CHOL%22]%7D%7D,%7B%22op%22:%22in%22,%22content%22:%7B%22field%22:%22files.data_category%22,%22value%22:%22Transcriptome%20Profiling%22%7D%7D,%7B%22op%22:%22in%22,%22content%22:%7B%22field%22:%22files.data_type%22,%22value%22:%22Gene%20Expression%20Quantification%22%7D%7D,%7B%22op%22:%22in%22,%22content%22:%7B%22field%22:%22files.analysis.workflow_type%22,%22value%22:%22HTSeq%20-%20Counts%22%7D%7D]%7D&pretty=true&format=JSON&size=10000&expand=analysis,analysis.input_files,associated_entities,cases,cases.diagnoses,cases.diagnoses.treatments,cases.demographic,cases.project,cases.samples,cases.samples.portions,cases.samples.portions.analytes,cases.samples.portions.analytes.aliquots,cases.samples.portions.slides&return_type=manifest'

It only happens with RNAseq type of data. I can download miRNAs data without problems. Initially I was working on a Macbook air with M1 chip:

sessionInfo() R version 4.1.1 (2021-08-10) Platform: x86_64-apple-darwin17.0 (64-bit) Running under: macOS Big Sur 11.6

Matrix products: default LAPACK: /Library/Frameworks/R.framework/Versions/4.1/Resources/lib/libRlapack.dylib

locale: [1] en_US.UTF-8/en_US.UTF-8/en_US.UTF-8/C/en_US.UTF-8/en_US.UTF-8

attached base packages: [1] stats4 parallel stats graphics grDevices utils datasets methods base

other attached packages: [1] stringr_1.4.0 readxl_1.4.0 tibble_3.1.6 oligo_1.56.0
[5] Biostrings_2.60.2 GenomeInfoDb_1.28.4 XVector_0.32.0 IRanges_2.26.0
[9] S4Vectors_0.30.2 oligoClasses_1.54.0 GEOquery_2.60.0 Biobase_2.52.0
[13] BiocGenerics_0.38.0 edgeR_3.34.1 limma_3.48.3 GDCRNATools_1.13.1

loaded via a namespace (and not attached): [1] utf8_1.2.2 tidyselect_1.1.2 RSQLite_2.2.12
[4] AnnotationDbi_1.54.1 htmlwidgets_1.5.4 grid_4.1.1
[7] BiocParallel_1.26.2 scatterpie_0.1.7 munsell_0.5.0
[10] codetools_0.2-18 preprocessCore_1.54.0 DT_0.22
[13] colorspace_2.0-3 GOSemSim_2.18.1 filelock_1.0.2
[16] knitr_1.38 rstudioapi_0.13 ggsignif_0.6.3
[19] DOSE_3.18.3 pathview_1.32.0 MatrixGenerics_1.4.3
[22] KEGGgraph_1.52.0 GenomeInfoDbData_1.2.6 KMsurv_0.1-5
[25] polyclip_1.10-0 bit64_4.0.5 farver_2.1.0
[28] downloader_0.4 vctrs_0.4.0 treeio_1.16.2
[31] generics_0.1.2 xfun_0.30 BiocFileCache_2.0.0
[34] affxparser_1.64.1 R6_2.5.1 graphlayouts_0.8.0
[37] locfit_1.5-9.5 bitops_1.0-7 cachem_1.0.6
[40] fgsea_1.18.0 gridGraphics_0.5-1 DelayedArray_0.18.0
[43] assertthat_0.2.1 promises_1.2.0.1 scales_1.1.1
[46] ggraph_2.0.5 enrichplot_1.12.3 gtable_0.3.0
[49] tidygraph_1.2.1 rlang_1.0.2 genefilter_1.74.1
[52] splines_4.1.1 rstatix_0.7.0 lazyeval_0.2.2
[55] broom_0.7.12 BiocManager_1.30.16 reshape2_1.4.4
[58] abind_1.4-5 backports_1.4.1 httpuv_1.6.5
[61] qvalue_2.24.0 clusterProfiler_4.0.5 tools_4.1.1
[64] ggplotify_0.1.0 ggplot2_3.3.5 affyio_1.62.0
[67] ellipsis_0.3.2 gplots_3.1.1 ff_4.0.5
[70] RColorBrewer_1.1-3 Rcpp_1.0.8.3 plyr_1.8.7
[73] progress_1.2.2 zlibbioc_1.38.0 purrr_0.3.4
[76] RCurl_1.98-1.6 prettyunits_1.1.1 ggpubr_0.4.0
[79] viridis_0.6.2 cowplot_1.1.1 zoo_1.8-9
[82] SummarizedExperiment_1.22.0 ggrepel_0.9.1 magrittr_2.0.3
[85] data.table_1.14.2 DO.db_2.9 survminer_0.4.9
[88] matrixStats_0.61.0 hms_1.1.1 patchwork_1.1.1
[91] mime_0.12 xtable_1.8-4 XML_3.99-0.9
[94] gridExtra_2.3 compiler_4.1.1 biomaRt_2.48.3
[97] KernSmooth_2.23-20 crayon_1.5.1 shadowtext_0.1.1
[100] htmltools_0.5.2 ggfun_0.0.6 later_1.3.0
[103] tzdb_0.3.0 tidyr_1.2.0 geneplotter_1.70.0
[106] aplot_0.1.3 DBI_1.1.2 tweenr_1.0.2
[109] dbplyr_2.1.1 MASS_7.3-56 rappdirs_0.3.3
[112] Matrix_1.4-1 car_3.0-12 readr_2.1.2
[115] cli_3.2.0 igraph_1.3.0 km.ci_0.5-2
[118] GenomicRanges_1.44.0 pkgconfig_2.0.3 xml2_1.3.3
[121] foreach_1.5.2 ggtree_3.0.4 annotate_1.70.0
[124] yulab.utils_0.0.4 digest_0.6.29 graph_1.70.0
[127] cellranger_1.1.0 fastmatch_1.1-3 survMisc_0.5.5
[130] tidytree_0.3.9 curl_4.3.2 shiny_1.7.1
[133] gtools_3.9.2 rjson_0.2.21 lifecycle_1.0.1
[136] nlme_3.1-157 GenomicDataCommons_1.16.0 jsonlite_1.8.0
[139] carData_3.0-5 viridisLite_0.4.0 fansi_1.0.3
[142] pillar_1.7.0 lattice_0.20-45 KEGGREST_1.32.0
[145] fastmap_1.1.0 httr_1.4.2 survival_3.3-1
[148] GO.db_3.13.0 glue_1.6.2 png_0.1-7
[151] iterators_1.0.14 bit_4.0.4 Rgraphviz_2.36.0
[154] ggforce_0.3.3 stringi_1.7.6 blob_1.2.2
[157] DESeq2_1.32.0 org.Hs.eg.db_3.13.0 caTools_1.18.2
[160] memoise_2.0.1 dplyr_1.0.8 ape_5.6-2

But I also have the same issue when I try to execute the same function in the cluster:

sessionInfo()

R version 4.1.3 (2022-03-10) Platform: x86_64-conda-linux-gnu (64-bit) Running under: Springdale Linux 7.9 (Verona)

Matrix products: default BLAS/LAPACK: /ifs/data/fg2532_lab/jc5737/Conda_env/lib/libopenblasp-r0.3.18.so

locale: [1] C

attached base packages: [1] stats4 stats graphics grDevices utils datasets methods
[8] base

other attached packages: [1] stringr_1.4.0 readxl_1.4.0 tibble_3.1.6
[4] oligo_1.58.0 Biostrings_2.62.0 GenomeInfoDb_1.30.1 [7] XVector_0.34.0 IRanges_2.28.0 S4Vectors_0.32.4
[10] oligoClasses_1.56.0 GEOquery_2.62.2 Biobase_2.54.0
[13] BiocGenerics_0.40.0 edgeR_3.36.0 limma_3.50.1
[16] GDCRNATools_1.14.0

loaded via a namespace (and not attached): [1] utf8_1.2.2 tidyselect_1.1.2
[3] RSQLite_2.2.12 AnnotationDbi_1.56.2
[5] htmlwidgets_1.5.4 grid_4.1.3
[7] BiocParallel_1.28.3 scatterpie_0.1.7
[9] munsell_0.5.0 preprocessCore_1.56.0
[11] codetools_0.2-18 DT_0.22
[13] colorspace_2.0-3 GOSemSim_2.20.0
[15] filelock_1.0.2 knitr_1.38
[17] ggsignif_0.6.3 DOSE_3.20.1
[19] pathview_1.34.0 MatrixGenerics_1.6.0
[21] KEGGgraph_1.54.0 GenomeInfoDbData_1.2.7
[23] KMsurv_0.1-5 polyclip_1.10-0
[25] bit64_4.0.5 farver_2.1.0
[27] downloader_0.4 vctrs_0.4.0
[29] treeio_1.18.1 generics_0.1.2
[31] xfun_0.30 BiocFileCache_2.2.1
[33] affxparser_1.66.0 R6_2.5.1
[35] graphlayouts_0.8.0 locfit_1.5-9.5
[37] bitops_1.0-7 cachem_1.0.6
[39] fgsea_1.20.0 gridGraphics_0.5-1
[41] DelayedArray_0.20.0 assertthat_0.2.1
[43] promises_1.2.0.1 scales_1.1.1
[45] ggraph_2.0.5 enrichplot_1.14.2
[47] gtable_0.3.0 tidygraph_1.2.1
[49] rlang_1.0.2 genefilter_1.76.0
[51] splines_4.1.3 rstatix_0.7.0
[53] lazyeval_0.2.2 broom_0.7.12
[55] BiocManager_1.30.16 reshape2_1.4.4
[57] abind_1.4-5 backports_1.4.1
[59] httpuv_1.6.5 qvalue_2.26.0
[61] clusterProfiler_4.2.2 tools_4.1.3
[63] ggplotify_0.1.0 ggplot2_3.3.5
[65] affyio_1.64.0 ellipsis_0.3.2
[67] gplots_3.1.1 ff_4.0.5
[69] RColorBrewer_1.1-3 Rcpp_1.0.8.3
[71] plyr_1.8.7 progress_1.2.2
[73] zlibbioc_1.40.0 purrr_0.3.4
[75] RCurl_1.98-1.6 prettyunits_1.1.1
[77] ggpubr_0.4.0 viridis_0.6.2
[79] zoo_1.8-9 SummarizedExperiment_1.24.0 [81] ggrepel_0.9.1 magrittr_2.0.3
[83] data.table_1.14.2 DO.db_2.9
[85] survminer_0.4.9 matrixStats_0.61.0
[87] hms_1.1.1 patchwork_1.1.1
[89] mime_0.12 xtable_1.8-4
[91] XML_3.99-0.9 gridExtra_2.3
[93] compiler_4.1.3 biomaRt_2.50.3
[95] KernSmooth_2.23-20 crayon_1.5.1
[97] shadowtext_0.1.1 htmltools_0.5.2
[99] ggfun_0.0.6 later_1.3.0
[101] tzdb_0.3.0 tidyr_1.2.0
[103] geneplotter_1.72.0 aplot_0.1.3
[105] DBI_1.1.2 tweenr_1.0.2
[107] dbplyr_2.1.1 MASS_7.3-56
[109] rappdirs_0.3.3 Matrix_1.4-1
[111] car_3.0-12 readr_2.1.2
[113] cli_3.2.0 parallel_4.1.3
[115] igraph_1.3.0 GenomicRanges_1.46.1
[117] pkgconfig_2.0.3 km.ci_0.5-6
[119] xml2_1.3.3 foreach_1.5.2
[121] ggtree_3.2.1 annotate_1.72.0
[123] yulab.utils_0.0.4 digest_0.6.29
[125] graph_1.72.0 cellranger_1.1.0
[127] fastmatch_1.1-3 survMisc_0.5.6
[129] tidytree_0.3.9 curl_4.3.2
[131] shiny_1.7.1 gtools_3.9.2
[133] rjson_0.2.21 lifecycle_1.0.1
[135] nlme_3.1-157 GenomicDataCommons_1.18.0
[137] jsonlite_1.8.0 carData_3.0-5
[139] viridisLite_0.4.0 fansi_1.0.3
[141] pillar_1.7.0 lattice_0.20-45
[143] KEGGREST_1.34.0 fastmap_1.1.0
[145] httr_1.4.2 survival_3.3-1
[147] GO.db_3.14.0 glue_1.6.2
[149] png_0.1-7 iterators_1.0.14
[151] bit_4.0.4 Rgraphviz_2.38.0
[153] ggforce_0.3.3 stringi_1.7.6
[155] blob_1.2.2 DESeq2_1.34.0
[157] org.Hs.eg.db_3.14.0 caTools_1.18.2
[159] memoise_2.0.1 dplyr_1.0.8
[161] ape_5.6-2

So I don't know how to solve the problem because when I try to troubleshoot the gdcRNADownload() function and follow line by line the code, it says that one of the inner functions (gdcGetURL()) it's not found. So I don't know where the error comes from because I can't access the URL containing the RNAseq data. It might even be a format problem with the downloaded data. I know this issue was reported before but given there was no follow-through, I thought a new threat might bring a bit more attention. Sorry guys and thanks a lot for your help!

Josu

pamonlan commented 2 years ago

I got the same issue, looks like the link to obtain the manifest from the gdc api has changed and now we get an empty table. They have to change the url query.

pranavkatariain commented 2 years ago

There is some issue with the HTSeq-Counts data on the GDC portal, I guess it is not available with the new update. So we need to change the workflow.type to "STAR - COUNTS".

Josuerinho commented 2 years ago

Hi all! I've been able to finally got access to the code of some of the used functions. So as @pranavkataria978 mentioned the issue was when trying the download the "RNA-seq" data type. The function gdcGetURL(), looks for the workflow type "HTSeq - Counts" that no longer exists. The workflow that might look close to this one now (after also checking the database) is "STAR - Counts" as he said. So if you create your own function gdcGetURL() with this small change, it made sense to me that it should work. But the only problem is that then, inside this function there is a bunch of other functions being called that for some reason now they are outside the original package function (no idea why this happens...) and they aren't found anymore. So in the end, I had to rename and save in the current environment a few more functions to make it all work again. After this step, now all these functions can be found and called. So here it goes as I have it right now to make RNAseq download work:

gdcGetURL_2 function (project.id, data.type) { urlAPI <- "https://api.gdc.cancer.gov/files?" if (data.type == "RNAseq") { data.category <- "Transcriptome Profiling" data.type <- "Gene Expression Quantification" workflow.type <- "STAR - Counts" ## Before we had "HTSeq-Counts" } else if (data.type == "miRNAs") { data.category <- "Transcriptome Profiling" data.type <- "Isoform Expression Quantification" workflow.type <- "BCGSC miRNA Profiling" } else if (data.type == "Clinical") { data.category <- "Clinical" data.type <- "Clinical Supplement" workflow.type <- NA } else if (data.type == "pre-miRNAs") { data.category <- "Transcriptome Profiling" data.type <- "miRNA Expression Quantification" workflow.type <- "BCGSC miRNA Profiling" } project <- paste("{\"op\":\"in\",\"content\":{\"field\":\"cases.", "project.project_id\",\"value\":[\"", project.id, "\"]}}", sep = "") dataCategory <- paste("{\"op\":\"in\",\"content\":{\"field\":\"files.", "data_category\",\"value\":\"", data.category, "\"}}", sep = "") dataType <- paste("{\"op\":\"in\",\"content\":{\"field\":\"files.data_type\",", "\"value\":\"", data.type, "\"}}", sep = "") workflowType <- paste("{\"op\":\"in\",\"content\":{\"field\":\"files.", "analysis.workflow_type\",\"value\":\"", workflow.type, "\"}}", sep = "") if (is.na(workflow.type)) { dataFormat <- paste("{\"op\":\"in\",\"content\":{\"field\":\"files.", "data_format\",\"value\":\"", "BCR XML", "\"}}", sep = "") content <- paste(project, dataCategory, dataType, dataFormat, sep = ",") } else { content <- paste(project, dataCategory, dataType, workflowType, sep = ",") } filters <- paste("filters=", URLencode(paste("{\"op\":\"and\",\"content\":[", content, "]}", sep = "")), sep = "") expand <- paste("analysis", "analysis.input_files", "associated_entities", "cases", "cases.diagnoses", "cases.diagnoses.treatments", "cases.demographic", "cases.project", "cases.samples", "cases.samples.portions", "cases.samples.portions.analytes", "cases.samples.portions.analytes.aliquots", "cases.samples.portions.slides", sep = ",") expand <- paste("expand=", expand, sep = "") payload <- paste(filters, "pretty=true", "format=JSON", "size=10000", expand, sep = "&") url <- paste(urlAPI, payload, sep = "") return(url) }

############# #############

And for the other funtions just renaming and saving them in my local environment for the problems I mentioned before:

downloadClientFun_2 <- function (os) { if (os == "Linux") { adress <- paste("https://gdc.cancer.gov/files/public/file/", "gdc-client_v1.6.0_Ubuntu_x64-py3.7_0.zip", sep = "") download.file(adress, destfile = "./gdc-client_v1.6.0_Ubuntu_x64-py3.7_0.zip") unzip("./gdc-client_v1.6.0_Ubuntu_x64-py3.7_0.zip") } else if (os == "Windows") { adress <- paste("https://gdc.cancer.gov/files/public/file/", "gdc-client_v1.6.0_Windows_x64-py3.7_0.zip", sep = "") download.file(adress, destfile = "./gdc-client_v1.6.0_Windows_x64-py3.7_0.zip") unzip("./gdc-client_v1.6.0_Windows_x64-py3.7_0.zip") } else if (os == "Darwin") { adress <- paste("https://gdc.cancer.gov/files/public/file/", "gdc-client_v1.6.0_OSX_x64_1.zip", sep = "") download.file(adress, destfile = "./gdc-client_v1.6.0_OSX_x64_1.zip") unzip("./gdc-client_v1.6.0_OSX_x64_1.zip") } }

############# #############

file.move_2 <- function (files, directory) { file.copy(from = files, to = directory, recursive = TRUE) unlink(files, recursive = TRUE) }

############# #############

manifestDownloadFun_2 <- function (manifest = manifest, directory) { if (!file.exists("gdc-client") & !file.exists("gdc-client.exe")) { downloadClientFun_2(Sys.info()[1]) } Sys.chmod("gdc-client") manifestDa <- read.table(manifest, sep = "\t", header = TRUE, stringsAsFactors = FALSE) ex <- manifestDa$filename %in% dir(paste(directory, dir(directory), sep = "/")) nonex <- !ex numFiles <- sum(ex) if (numFiles > 0) { message(paste("Already exists", numFiles, "files !", sep = " ")) if (sum(nonex) > 0) { message(paste("Download the other", sum(nonex), "files !", sep = " ")) manifestDa <- manifestDa[nonex, ] manifest <- paste(manifestDa$id, collapse = " ") system(paste("./gdc-client download ", manifest, sep = "")) } else { return(invisible()) } } else { system(paste("./gdc-client download -m ", manifest, sep = "")) } files <- manifestDa$id if (directory == "Data") { if (!dir.exists("Data")) { dir.create("Data") } } else { if (!dir.exists(directory)) { dir.create(directory, recursive = TRUE) } } file.move_2(files, directory) }

############# #############

gdcRNADownload_2 <- function (manifest = NULL, project.id, data.type, directory = "Data", write.manifest = FALSE, method = "gdc-client") { if (!is.null(manifest)) { manifestDownloadFun_2(manifest = manifest, directory = directory) } else { url <- gdcGetURL_2(project.id = project.id, data.type = data.type) manifest <- read.table(paste(url, "&return_type=manifest", sep = ""), header = TRUE, stringsAsFactors = FALSE) systime <- gsub(" ", "T", Sys.time()) systime <- gsub(":", "-", systime) manifile <- paste(project.id, data.type, "gdc_manifest", systime, "txt", sep = ".") write.table(manifest, file = manifile, row.names = FALSE, sep = "\t", quote = FALSE) if (method == "GenomicDataCommons") { ex <- manifest$filename %in% dir(directory) nonex <- !ex numFiles <- sum(ex) if (numFiles > 0) { message(paste("Already exists", numFiles, "files !", sep = " ")) if (sum(nonex) > 0) { message(paste("Download the other", sum(nonex), "files !", sep = " ")) manifest <- manifest[nonex, ] fnames = lapply(manifest$id, gdcdata, destination_dir = directory, overwrite = TRUE, progress = TRUE) } else { return(invisible()) } } else { fnames = lapply(manifest$id, gdcdata, destination_dir = directory, overwrite = TRUE, progress = TRUE) } } else if (method == "gdc-client") { manifestDownloadFun_2(manifest = manifile, directory = directory) } if (write.manifest == FALSE) { invisible(file.remove(manifile)) } } }

############# #############

I believe I haven't missed any of them. Now it should all work nicely. For example:

project <- 'TCGA-PRAD' gdcRNADownload_2(project.id = project, data.type = 'RNAseq', write.manifest = FALSE, method = 'gdc-client', directory = "Your/Own/directory")

Let me know if I may have missed sth!

benchsar commented 1 year ago

Hello @Josuerinho

I want to test your code, but i have this error :

Error in paste(filters, "pretty=true", "format=JSON", "size=10000", expand, : object 'filters' not found > url <- paste(urlAPI, payload, sep = "") Error in paste(urlAPI, payload, sep = "") : object 'urlAPI' not found > return(url) Error: no function to return from, jumping to top level > } Error: unexpected '}' in "}" --   > | > >
Josuerinho commented 1 year ago

Hi @benchsar! Sorry for the late reply. That code I posted was just a little workaround to original functions to get them to work but the problem has been solved and the original functions work as expected again. Try it and let me know if that it's not the case.