rvalieris / parallel-fastq-dump

parallel fastq-dump wrapper
MIT License
265 stars 33 forks source link

storage exhausted while writing file within file system module #19

Closed PhilPalmer closed 5 years ago

PhilPalmer commented 5 years ago

Hi,

I am trying to download several reads. Eg using the following command:

parallel-fastq-dump --sra-id SRR925794 --threads 32 --gzip

But when I do, I get this error message:

fastq-dump.2.9.1 err: storage exhausted while writing file within file system module - system bad file descriptor error fd='7'

I am running this on an AWS EC2 instance which has plently of disk space and yet I am still getting this error. However, I am not sure if it is because parallel-fastq-dump is mounting volumes and running it there which have less disk space. It looks like these volumes are mounted on the instance and I am not sure how else I might have created them.

Do you know if this is the case and if so how I can change the location where the command is run to prevent this error.

Filesystem      Size  Used Avail Use% Mounted on
/dev/xvda1      394G   56G  339G  15% /
devtmpfs        121G   96K  121G   1% /dev
tmpfs           121G     0  121G   0% /dev/shm
/dev/dm-3       9.8G  4.1G  5.2G  45% /var/lib/docker/devicemapper/mnt/686a42157daedddaf1f9e187ea042311bbc553e466013fb79adc4bec8da51432
shm              64M     0   64M   0% /var/lib/docker/containers/dada83892501925da80d6abd9c27cc048c1e2c3fb90b6f587d051ca0f5e8c12c/shm

Also do you know how I can get it to run any faster. For example do you know how parallel-fastq-dump compares to fasterq-dump. I tried running that instead and it seems like it may be a bit slower. Or what is the optimum value to set --threads to? For example, I know fasterq-dump for the value of th has diminishing returns, does parallel-fastq also have the same?

rvalieris commented 5 years ago

Hello,

I am running this on an AWS EC2 instance which has plently of disk space and yet I am still getting this error.

note that by default temporary files are created on /tmp, if this /tmp doesn't have much space you will get these errors, to be sure try using the --tmpdir parameter to set the tmp dir explicitly.

I am not sure if it is because parallel-fastq-dump is mounting volumes and running it there

neither parallel-fastq-dump or fastq-dump does anything like that.

Also do you know how I can get it to run any faster. For example do you know how parallel-fastq-dump compares to fasterq-dump

I haven't had the time to test fasterq-dump extensively, but if space is a issue fasterq will be a problem because it doesn't support compressing the output on-the-fly.

If you haven't downloaded your target SRA prior to running fastq-dump you should, it speed things up considerably specially using aspera for the download. The optimum number of threads will depend on the number of reads of the SRA file. a small file won't benefit from 40 threads. Think in terms of # of reads per thread.

PhilPalmer commented 5 years ago

@rvalieris that's great, thank you for your quick reply. Still not exactly sure what was causing the error but I will definitely try using the --tmpdir option.

Also, I sometimes get connection issues, eg:

sys: connection failed while opening file within cryptographic module - mbedtls_ssl_handshake returned -76 ( NET - Reading information from the socket failed )

Do you know what is causing this? Could it be from parallel-fastq-dump making too many requests or is it more likely a problem in general with the connection to ncbi? The latter seems more likely as other people seem to be experiencing the same issue

Thanks again

rvalieris commented 5 years ago

yes, that connection error is from fastq-dump connecting to the server. its usually benign (because it will retry the connection automatically), unless you are getting too many of these errors at once, in this case it could mean you have too many connections open, so try reducing the number of threads.

PhilPalmer commented 5 years ago

Okay perfect, thanks for your help