rvalieris / parallel-fastq-dump

parallel fastq-dump wrapper
MIT License
275 stars 33 forks source link

Parallel-fastq-dump has been running for nearly 24 hours #52

Closed ZhangMH2000 closed 1 year ago

ZhangMH2000 commented 1 year ago

Dear developer,

Greetings! I would like to express my gratitude for developing this software. I have been utilizing it since yesterday to convert a 10X single-cell sequencing SRA file, which has a size of 14.8Gb. However, I have noticed that the program has been running continuously for nearly 24 hours without producing any output. Upon inspecting the program's CPU usage using hTOP, I observed that it did not exceed 100%. I am uncertain whether this behavior is normal, and thus, I seek your expert opinion. I have been running the program on my home computer with 16 threads simultaneously, yet the outcome remains unchanged.

The first image illustrates the command I used and the corresponding output, while the second image showcases the CPU utilization as displayed in hTOP. 1 2

I would highly appreciate it if you could provide me with any suggestions or guidance.

Thank you sincerely!

rvalieris commented 1 year ago

hello,

thanks for the screenshots, it looks like its stuck on the sra-stat command, being stuck like this for hours is definitely not normal. I just tried it here with this same SRR and it finished in 8 seconds, so I'm not sure what the problem could be.

did you tried others SRRs ? and what is the version of the sra-tools you are using ?

try updating the sra-tools package to the latest version(3.0.5), and try running: sra-stat --meta --quick SRR12492114

ZhangMH2000 commented 1 year ago

Hello! Thank you for your reply. The version of my SRA Tools is 3.0.5. While terminating the program using Ctrl+C, I received feedback, but I neglected to save the output result. If you need it, I can later upload the output from my home programme termination, as I believe it is also still running. I just deleted the entire Conda environment from my laptop and reinstalled it using the following command: mamba install parallel-fastq-dump 'sra-tools=3.0.5'

One thing worth mentioning is that I previously stored all SRA files on a mechanical hard drive, while my Ubuntu system is installed on a separate solid-state hard drive. I'm not sure if this could be the cause of the problem. As a workaround, I copied an SRA file to a folder on the Linux system and successfully executed it.

However, I'm currently facing another issue. After entering the instruction and receiving the output, it eventually prompts an error and exits. Nevertheless, when I check hTOP, it shows that the CPU is fully threaded and actively processing data. I'm puzzled as to whether the programme is running and why the error prompt occurs in this case. I appreciate any insights you can provide. Thank you very much!

2023-05-23 21-42-20屏幕截图 2023-05-23 21-42-49屏幕截图

rvalieris commented 1 year ago

this error:

timeout exhausted while reading file within network indicates a connection error while trying to download the data, since the others processes didn't give the same error, I think you can try to download again it should work.

first kill the remaining process:

killall sra-stat
killall fastq-dump

and try running it again.

I don't think the hard drive is the problem, however I will recommend using the --gzip option to minimize disk usage.

ZhangMH2000 commented 1 year ago

Hi! I just rebooted the system and encountered the same issue at sra-stat once again. This time, I attempted to interrupt the program using the crtl+C command, resulting in the following output. Despite repeating this process multiple times, the outcome remains unchanged. Can you provide any suggestions or advice? Thank you! 2023-05-23 23-19-15屏幕截图

rvalieris commented 1 year ago

looks like its stuck on sra-stat again. I think it is a network issue. but its hard to know for sure without any logs.

try this command: sra-stat -vvv --meta --quick SRR12492113

you could also try prefetch <SRR> to pre-download the sra before executing fastq-dump

ZhangMH2000 commented 1 year ago

Hi! Thanks for your suggestions. I have tried multiple times, but the programme always stuck in either sra-stat or the following steps. By using -vvv, i found the reason may be due to the proxy server, or other network issues. The terminated traces information were always 'timeout', which I think can only be due to the network reason. Thanks to your suggestions, I then searched for how to process the files locally, as I always used to consider the command was used to process local files instead of downloading new files. I'm not sure whether I found the correct use, should it be like this?:

parallel-fastq-dump --threads 12 --split-files --outdir out/ --sra-id SRR12492113.sra

which has '.sra' suffix, so that it can process this file locally? However, I still found it trying downloading files from network, which faces the same issues as before. Could you help me, please?

rvalieris commented 1 year ago

after downloading the sra file with prefetch you should have a directory with the sra file inside, you just need to use the full path to the sra file instead of the ID, for example: parallel-fastq-dump --threads 12 --split-files --outdir out/ --sra-id SRR23948107/SRR23948107.sra

but for this to work you need to make sure prefetch downloaded the complete file first, like this:

$ prefetch SRR23948107

2023-05-24T16:38:00 prefetch.3.0.5: Current preference is set to retrieve SRA Normalized Format files with full base quality scores.
2023-05-24T16:38:13 prefetch.3.0.5: 1) Downloading 'SRR23948107'...
2023-05-24T16:38:13 prefetch.3.0.5: SRA Normalized Format file is being retrieved, if this is different from your preference, it may be due to current file availability.
2023-05-24T16:38:13 prefetch.3.0.5:  Downloading via HTTPS...
2023-05-24T16:39:43 prefetch.3.0.5:  HTTPS download succeed
2023-05-24T16:39:43 prefetch.3.0.5:  'SRR23948107' is valid
2023-05-24T16:39:43 prefetch.3.0.5: 1) 'SRR23948107' was downloaded successfully
2023-05-24T16:39:43 prefetch.3.0.5: 'SRR23948107' has 0 unresolved dependencies
ZhangMH2000 commented 1 year ago

Hi, I just checked the files, and use the following command:

parallel-fastq-dump --threads 16 --split-files --outdir out/ --gzip --sra-id /media/zhangmh/GSE156337/SRR12492114/SRR12492114.sra

However, the output still included sra-stat. The 'SRR12492114.sra' file was in '12492114' folder, which was downloaded entirely through prefetch and included all SRA files dependencies. So I'm confused why this command still leads to a online searching. Thank you!

rvalieris commented 1 year ago

is it still getting stuck on sra-stat ?

yes, it will include sra-stat because it needs to query how many reads are in the sra file to split by threads, however I don't think it needs network after the sra was downloaded with prefetch, so it should work.

try this: sra-stat -vvv --meta --quick /media/zhangmh/GSE156337/SRR12492114/SRR12492114.sra it should work even without internet I think.

ZhangMH2000 commented 1 year ago

Hi! Actually I didn't understand the correct use of this command. When I type sra-stat -vvv --meta --quick /media/zhangmh/GSE156337/SRR12492114/SRR12492114.sra, the output indeed showed the reads. Some output just like this:

2023-05-25 T 00:58:34 sra - stat .3.0.5: Seeding the random number generator 
2023-05-25 T 00:58:34 sra - stat .3.0.5: Loading CA root certificates 
2023-05-25 T 00:58:34 sra - stat .3.0.5: Configuring ssl defaults 
2023-05-25 T 00:58:34 sra - stat .3.0.5: KClientHttpOpen - connected from '127.0.0.1' to 127.0.0.1(127.0.0.1)
2023-05-25T06:58:34 sra - stat .3.0.5: KClientHttpOpen - connected from '127.0.0.1' to 127.0.0.1(127.0.0.1)
2023-05-25 T 00:58:34 sra - stat .3.0.5: KClientHttpOpen - connected from '127.0.0.1' to 127.0.0.1(127.0.0.1)
2023-05-25 T 00:58:34 sra - stat .3.0.5: KClientHttpOpen - connected from '127.0.0.1' to 127.0.0.1(127.0.0.1)
2023-05-25 T 00:58:34 sra - stat .3.0.5: Setting up SSL / TLS structure 
2023-05-25 T 00:58:34 sra - stat .3.0.5: Performing SSL / TLS handshake ...
2023-05-25T00:58:35 sra - stat .3.0.5: KClientHttpOpen - verifying CA cert 
2023-05-25T00:58:35 sra - stat .3.0.5: Verifytng peer X .509 certificate ...
2023-05-25T00:58:35 sra - stat .3.0.5: Reading from server ...
2023-05-25T00:58:35 sra - stat .3.0.5: Reading from server ...
2023-05-25T00:58:35 sra - stat .3.0.5: Reading from server ...
/media / zhangmh / WD 紫盘/GSE156337/SRR12492114/SRR12492114.sra| CTAGCGAG |96555767:9604609905:9664609905|:|:|:
/media / zhangmh / WD 紫盘/GSE156337/SRR12492114/SRR12492114.sra| GACTACGT |110366566:10978339709:16978339709|:|:|:
/media / zhangnh / WD 紫盘/GSE156337/SRR12492114/SRR12492114.sra| TCTATATC |74525918:7413200374:7413200374|:|:|:
/media / zhangmh / WD 紫盘/GSE156337/SRR12492114/SRR12492114.sra| AGGCGTCA |96503007:9599382455:9599382455|:|:|:
/media / zhangnh / WD 紫盘/GSE156337/SRR12492114/SRR12492114.sra| CTATCGAG |267350:26594340:26594340|:|:|:
/

However, how should I combine this command with the parallel-fastq-dump command? Since I can't use the command like this:

parallel-fastq-dump -vvv --threads 16 --split-files --outdir out/ --gzip --sra-id /media/zhangmh/GSE156337/SRR12492114/SRR12492114.sra 

So how should I correctly use the command to skip the online sra-stat? Thank you!

rvalieris commented 1 year ago

parallel-fastq-dump needs the data from sra-stat to work, its not possible to skip.

you could try to run without the parallel fastq-dump --split-files --outdir out/ --gzip /media/zhangmh/GSE156337/SRR12492114/SRR12492114.sra to see if that works.

ZhangMH2000 commented 1 year ago

Hi! Thanks for your help. It seems like that the network issues cannot be solved. I turned to the fasterq-dump tool to convert the downloaded SRA files with their dependencies, and it finally worked. Since fasterq-dump needs to imput the path of the directory including SRA files and their dependencies rather than the path of SRA file, it seems that fasterq-dump can fully convert downloaded SRA files to fastq. In my computers with 16 threads and 64G RAM, fasterq-dump took nearly 100 minutes to convert a 15G 10X scRNA file. I believe parallel-fastq-dump can make it much quicker. Anyway, thank you sincerely for your kind patience. Best wishes!