Open Midnighter opened 2 years ago
hello, thanks for letting me know.
I'm don't use nextflow so excuse me if I misunderstood but, did you made sure that prefetch is also ran before parallel-fastq-dump ?
also, if you are running multiple prefetch/parallel-fastq-dump/fasterq-dump concurrently this could be affecting the results.
Happy to answer your questions:
fastq-dump (which actually does the work internally on parallel-fastq-dump) is not very good at downloading stuff, so it would be faster to run prefetch first and use parallel-fastq-dump just to dump the sra, but I guess it depends on how you want to compare, you could:
its already expected that downloading with fastq-dump is worst than downloading with prefetch.
I see, I had understood parallel-fastq-dump as a "complete package" so I hadn't considered using prefetch first.
yes, and I bet many people run parallel-fastq-dump without realizing this (despite being explained on the readme), so I wouldn't say its an "unfair" comparison.
another interesting point is using --gzip
or --bzip2
on parallel-fastq-dump, for big SRAs the size difference between compressed and uncompressed fastq can be very big, so writing compressed files could finish faster (maybe ? idk), the problem is that fasterq-dump doesn't support writing compressed files so you cant compare 1:1
another interesting point is using
--gzip
or--bzip2
on parallel-fastq-dump, for big SRAs the size difference between compressed and uncompressed fastq can be very big, so writing compressed files could finish faster (maybe ? idk), the problem is that fasterq-dump doesn't support writing compressed files so you cant compare 1:1
I think this will be a next step. My plan is to compare the speed of compressed output from parallel-fastq-dump with running fasterq-dump + pigz for parallel compression. Do you know what the compression level is in fastq-dump --gzip
?
Do you know what the compression level is in fastq-dump --gzip?
no idea, but I guess leaving pigz on the default would be reasonable.
Alright, I've made a new benchmark first using prefetch
for all tools and also including compression. I have summarized the results in the readme https://github.com/UnseenBio/sra-demo-benchmark/tree/benchmark-prefetch
very interesting, a few comments in no order in particular:
dont take this as criticism, its just things that I thought while looking at the plots.
I think you make very valid points. I'm currently limited to one desktop computer, though 🙂 So the combinatorial increase in jobs and threads is a bit much for that. If you have more resources available, I'm happy to adjust the pipeline accordingly so that you can run it with one command.
maybe the variance of cpu/mem can be explained by the sra size
This is actually a very minor factor to me. I only made those plots since the data was there 😉 Yes, the distributions are almost bimodal because of the small and large sequences. A different set of input IDs that has a better representation over a large range could be nice indeed.
would parallel-fastq-dump+pigz be faster ?
Possibly, I guess ideal scenario would be for each process/thread to write compressed output directly in a compatible way so that everything can be stitched together. Or maybe that is how it currently happens? I also still don't know the compression level of fastq-dump --gzip
so it might be higher than pigz
's default.
@wraetz (I assume from NCBI) also talked about possible differences depending on the alignment of the SRA files. Although I don't have a relevant set of IDs to test that yet.
yeah I understand adding more comparisons increases the load significantly, I was just wondering really.
Possibly, I guess ideal scenario would be for each process/thread to write compressed output directly in a compatible way so that everything can be stitched together. Or maybe that is how it currently happens?
I'm not sure what you're getting at but, each fastq-dump --gzip
writes a separate gzip file concurrently, all parallel-fastq-dump does is cat
the compressed files in the correct order in the end, yes it works as expected.
what I was suggesting tho, is to test your hyphotesis that fasterq-dump+pigz
is faster due to pigz, you could run parallel-fastq-dump
uncompressed + pigz.
Hi,
This is more for your information and not an issue.
I wanted to let you know about a comparison that I ran between
parallel-fastq-dump
andsra-tools
prefetch
+fasterq-dump
. You can find the code and results in this repo.This is the way that I invoke
parallel-fastq-dump
so if you see some problem, tweak, or think that it is an unfair comparison, please let me know.