mksamur / RTCGAToolbox

17 stars 13 forks source link

getFirehoseData error: Error in dirname(name) : path too long #21

Closed JunzeChen closed 4 years ago

JunzeChen commented 7 years ago

Dear mksamur/RTCGAToolbox: @mksamur RTCGAToolbox is a powerful and useful package, however, when I use Data <- getFirehoseData(dataset="LUAD", runDate="20160128", Clinic=TRUE, RNAseq_Gene=TRUE, mRNA_Array=TRUE, Mutation=TRUE) to download data, there was error like: Error in dirname(name) : path too long

Then how to solve this problem.

Many thanks! Junze Chen

sessionInfo() R version 3.4.0 (2017-04-21) Platform: x86_64-w64-mingw32/x64 (64-bit) Running under: Windows >= 8 x64 (build 9200)

Matrix products: default

locale: [1] LC_COLLATE=Chinese (Simplified)_China.936 [2] LC_CTYPE=Chinese (Simplified)_China.936
[3] LC_MONETARY=Chinese (Simplified)_China.936 [4] LC_NUMERIC=C
[5] LC_TIME=Chinese (Simplified)_China.936

attached base packages: [1] stats graphics grDevices utils datasets methods base

other attached packages: [1] RTCGAToolbox_2.5.2

loaded via a namespace (and not attached): [1] Rcpp_0.12.10 lattice_0.20-34 XML_3.98-1.6 bitops_1.0-6
[5] grid_3.4.0 plyr_1.8.4 gtable_0.2.0 scales_0.4.1
[9] ggplot2_2.2.1 lazyeval_0.2.0 data.table_1.10.4 RCircos_1.2.0
[13] limma_3.31.21 Matrix_1.2-8 cowplot_0.7.0 splines_3.4.0
[17] RJSONIO_1.3-0 tools_3.4.0 RCurl_1.95-4.8 munsell_0.4.3
[21] survival_2.41-2 compiler_3.4.0 colorspace_1.3-2 tibble_1.3.0

The code is: `> library(RTCGAToolbox)

getFirehoseDatasets() getFirehoseRunningDates(last=2) Data<-getFirehoseData(dataset = "LUAD",runDate = "20160128",Clinic = TRUE,RNAseq_Gene = TRUE,mRNA_Array = TRUE,Mutation = TRUE) diffGeneExprs = getDiffExpressedGenes(dataObject=Data,DrawPlots=TRUE,adj.method="BH",adj.pval=0.05,raw.pval=0.05, logFC=2,hmTopUpN=100,hmTopDownN=100)`

mksamur commented 7 years ago

Hi @JunzeChen Unfortunately, this is a problem with Windows operating systems. There is a character limit for path on Windows and that makes the problem.

dmbergau commented 6 years ago

Has there been any further activity on this? Corporate standards constrain me to using a Windows system and my working directory is c:/temp, which is the shortest I am allowed. The name of the file I am trying to download from the Broad Institute is "gdac.broadinstitute.org_LAML.Merge_methylationhumanmethylation27jhu_usc_edu__Level_3within_bioassay_data_set_functiondata.Level_3.2016012800.0.0.tar.gz" which is triggering the following error:

Content type 'application/x-gzip' length 49624216 bytes (47.3 MB) downloaded 47.3 MB

Error in dirname(name) : path too long

The download appears to have worked based on the messages, so I am wondering if there is something that can be coded into an updated version of the RTCGAToolbox package that can somehow shorten the name of the file after the download to fit the Windows limit before further processing it. As a thought, maybe retain the original name in an object for verification purposes and/or write it out in the messages for tracking in the even that multiple files are downloaded.

LiNk-NY commented 6 years ago

Hi @dmbergau, Yes, this is a known issue for Windows users, see #19 and #22. I can have a look at it but I can't guarantee a fix. If you'd like to take a stab at it, I'll be happy to review a pull request.

Best regards, Marcel

dmbergau commented 6 years ago

Thank you Marcel for your quick response.

I am a physiologist and bit of a newbie programmer and to Github so the specifics and syntax of this are well above my programming capabilities, but in terms of "R-ish" pseudocode inspired by #19 (hopefully) without having to use external software like 7-zip, here is what I am thinking:

maxchar <- 200 # or whatever the Windows path character limit is append <- 10 # however much room you want to leave at the end to append an increment

check the length of the file name

if (length(fileList) == 1 & nchar(fileList) > maxchar){

save the name of the downloaded file, not the file itself, for tracking purposes, maybe in a dataframe?

tmp1 <- fileList # not sure of correct syntax

truncate the file name and leave room for increments. Not sure of syntax to do it, but subtract the number of characters to append from the maximum allowed by Windows

shorter <- fileList[maxchar-append] # probably wrong syntax, maybe substring?

check to see if there is a file with the same name that has already been shortened

if not, append _1

if (shorter does not already exist){ shortenedName <- paste0(shorter, "_1") file.rename(tmp1, shortenedName) increment <- increment + 1 }

if so, append an incremented number to the shortened name

else if (shorter already exists){

get the maximum increment for this file name, not sure how to do it (substring? check dataframe?)

# add 1 to it
shortenedName <- paste0(shorter, "_", increment)
file.rename(tmp1, shortenedName)
increment <- increment + 1

}

add the shortened file name to a second column of the data frame that holds the corresponding incoming file name, i.e. "original name" and "shortened name"

untar the file with the shorter name and whatever other processing needs to happen from here

}

There may be more to it programmatically, but this is the best I can do.

Best regards, Dennis

lwaldron commented 6 years ago

You might also want to try curatedTCGAData, which further processes data from RTCGAToolbox to provide SummarizedExperiment and RaggedExperiment objects within MultiAssayExperiments, so that all assays are linked to patient data and to each other. It doesn't have the same issue on Windows.

dmbergau commented 6 years ago

Thank you Levi, I will give that a try and let you know.

Warm regards, Dennis


Dennis M. Bergau, MA, PHDmailto:dennis.bergau@abbvie.com | Sr. Research Pharmacologist, Cardiac Safety Clinical Systems | Clinical Pharamcology & Pharmacometrics 480 South US Rt. 45 | Grayslake, IL 60030 USA OFFICE +1 847-936-3669 | EMAIL dennis.bergau@abbvie.commailto:dennis.bergau@abbvie.com

abbvie.comhttp://www.abbvie.com This communication may contain information that is proprietary, confidential, or exempt from disclosure. If you are not the intended recipient, please note that any other dissemination, distribution, use or copying of this communication is strictly prohibited. Anyone who receives this message in error should notify the sender immediately by telephone or by return e-mail and delete it from his or her computer.


From: Levi Waldron [mailto:notifications@github.com] Sent: Sunday, June 24, 2018 5:17 AM To: mksamur/RTCGAToolbox Cc: Bergau, Dennis M; Mention Subject: [EXTERNAL] Re: [mksamur/RTCGAToolbox] getFirehoseData error: Error in dirname(name) : path too long (#21)

You might also want to try curatedTCGADatahttp://www.bioconductor.org/packages/curatedTCGAData/, which further processes data from RTCGAToolbox to provide SummarizedExperiment and RaggedExperiment objects within MultiAssayExperiments, so that all assays are linked to patient data and to each other. It doesn't have the same issue on Windows.

— You are receiving this because you were mentioned. Reply to this email directly, view it on GitHubhttps://github.com/mksamur/RTCGAToolbox/issues/21#issuecomment-399745371, or mute the threadhttps://github.com/notifications/unsubscribe-auth/AmiHqWVp5DIypUllNbH8-C0FrpqVAQjNks5t_2c1gaJpZM4NGgUe.

LiNk-NY commented 4 years ago

I've added a warning see #29. There is not much that can be done (in R) since it is more of an OS issue. Best, Marcel