ncbi / sra-tools

SRA Tools
Other
1.07k stars 243 forks source link

Issue with Downloading more than 1000 files #862

Closed kcmtest closed 8 months ago

kcmtest commented 9 months ago

So i m running a pipeline where I have multiple steps the sra tool kit fails at the make_fastq where a given SRR id passed as input to the prefetch, then vdb-validate and then finally fasterq-dump. So in one of the studies i ran where we have more than 1000 samples of single end read. Out of which it could download 343 samples but it failed at sample at one of the samples where the error log shows this

[``` \d \t] PS4='[\d \t] ' [\d \t] basename /ces/docker-stagedir/stgbd9b6702-c3c5-4c36-8f11-e97a0fa61711/GSM5976817.id [\d \t] a=GSM5976817.id [\d \t] id=GSM5976817 [\d \t] vdb-config --interactive 2023-10-09T12:55:48 vdb-config.3.0.1 fatal: SIGNAL - Segmentation fault [\d \t] prefetch --max-size 420G GSM5976817 2023-10-09T12:55:49 prefetch.3.0.1: Current preference is set to retrieve SRA Normalized Format files with full base quality scores. [\d \t] ls -1 [\d \t] grep -v make_fastq.sh [\d \t] sraID= [\d \t] vdb-validate 2023-10-09T12:55:49 vdb-validate.3.0.1 err: param insufficient while validating path - No paths to validate Usage: vdb-validate [options] path [ path... ]

Use option --help for more information.

[\d \t] fasterq-dump [\d \t] echo 'GSM5976817.fastq' [\d \t] sleep 1 [\d \t] pigz 'GSM5976817.fastq' pigz: skipping: GSM5976817.fastq does not exist [\d \t] wait [\d \t] '[' -f GSM5976817_R1.fastq.gz ] [\d \t] echo 'Space before tmp folder delete' [\d \t] echo 'Removing tmp folder ' [\d \t] ls make_fastq.sh [\d \t] rm -rf make_fastq.sh [\d \t] ls '' ls: *: No such file or directory [\d \t] echo 'done'



This indicating prefetch didn't work, although it passed the validate-fastq and i have also checked the file contains data in ncbi database. 
 Now in order to fix this I tired vdb-config increase timeout setting. Still no success.

Please let me know what sort of modification or setting should i be doing to fix this.?

Any suggestion would be really really helpful.
durbrow commented 9 months ago

GSM5976817 is SRR18509337. It is a small file, about a GB. I suggest updating to the latest version of the software and then trying it again.

durbrow commented 9 months ago

By the way, prefetch GSM5976817 will download SRR18509337, not anything named GSM5976817. The toolkit only uses run accessions in the SRA namespace, with names like SRR000001 and ERR123456; it doesn't use GEO accessions like GSM5976817. The toolkit queries a service at NCBI, the SRA Data Locator (SDL) service, to get URLs for accessions, and the SDL is replying with a URL for SRR18509337. So, if your script/pipeline is looking for GSM5976817*, it will not exist.

kcmtest commented 9 months ago

By the way, prefetch GSM5976817 will download SRR18509337, not anything named GSM5976817. The toolkit only uses run accessions in the SRA namespace, with names like SRR000001 and ERR123456; it doesn't use GEO accessions like GSM5976817. The toolkit queries a service at NCBI, the SRA Data Locator (SDL) service, to get URLs for accessions, and the SDL is replying with a URL for SRR18509337. So, if your script/pipeline is looking for GSM5976817*, it will not exist.

yes true regarding your prefetch GSM5976817 since this is what we get as annotation table which after downloading SRR files and getting fastq we rename it back to the GSM id which is for a particular studies. My question is why it fails i mean i could download like 350 files from 1000 as when one of these fails the pipeline fails. So can you suggest me how do i get around the issues, what configuration change do i need to bypass this?

"So, if your script/pipeline is looking for GSM5976817*, it will not exist." regarding this

first step is i read out the GSM from the annotation sheet i do this

prefetch -t http GSM5976817 

2023-10-09T19:19:14 prefetch.3.0.7: Current preference is set to retrieve SRA Lite files with simplified base quality scores.
2023-10-09T19:19:15 prefetch.3.0.7: 1) Downloading 'SRR18509337.lite'...
2023-10-09T19:19:15 prefetch.3.0.7: SRA Lite file is being retrieved, if this is different from your preference, it may be due to current file availability.
2023-10-09T19:19:15 prefetch.3.0.7:  Downloading via HTTPS...
2023-10-09T19:20:49 prefetch.3.0.7:  HTTPS download succeed
2023-10-09T19:20:51 prefetch.3.0.7:  'SRR18509337.lite' is valid
2023-10-09T19:20:51 prefetch.3.0.7: 1) 'SRR18509337.lite' was downloaded successfully
2023-10-09T19:20:51 prefetch.3.0.7: 'GSM5976817' has 0 unresolved dependencies 

Once this pass i run vdb-validate, then fasterq-dump and finally rename the SRR into GSM back again.
kcmtest commented 9 months ago

The toolkit queries a service at NCBI, the SRA Data Locator (SDL) service, to get URLs for accessions, and the SDL is replying with a URL for SRR18509337. So, if your script/pipeline is looking for GSM5976817*, it will not exist and is it failing due to network issues?

How could I point it back to SRR18509337 for the given GSM id? which is GSM5976817 in case if it fails ?

kcmtest commented 9 months ago

GSM5976817 is SRR18509337. It is a small file, about a GB. I suggest updating to the latest version of the software and then trying it again.

yes my docker image is running with the latest version only.

durbrow commented 9 months ago

The links out from SRA back to GEO don't seem to exist directly, but you could get there by following the link to Biosample, and then Biosample to GEO.

kcmtest commented 9 months ago

The links out from SRA back to GEO don't seem to exist directly, but you could get there by following the link to Biosample, and then Biosample to GEO.

okay. as i mentioned if i have to download like more than 1000 samples, do i need to change the storage setting inside my docker container what i read from here is default is about 100 gb for the root how can i change or increase it ?https://harvardmed.atlassian.net/wiki/spaces/O2/pages/1623426169/sratoolkit+2.10.7+to+download+NCBI+SRA+data

kcmtest commented 8 months ago

The toolkit queries a service at NCBI, the SRA Data Locator (SDL) service, to get URLs for accessions, and the SDL is replying with a URL for SRR18509337. So, if your script/pipeline is looking for GSM5976817*, it will not exist and is it failing due to network issues?

yes most of the time network perhaps so we made with a retry function so far not much success for some id it works and for some it doesn't may be as you suggested we are using GSM would it be resolved if we go for SRR id? and to my original question which is it possible to download 1000 files in a single run?