Benchmark comparison - Githubissues

Midnighter commented 2 years ago

Hi,

This is more for your information and not an issue.

I wanted to let you know about a comparison that I ran between parallel-fastq-dump and sra-tools prefetch + fasterq-dump. You can find the code and results in this repo.

This is the way that I invoke parallel-fastq-dump so if you see some problem, tweak, or think that it is an unfair comparison, please let me know.

rvalieris commented 2 years ago

hello, thanks for letting me know.

I'm don't use nextflow so excuse me if I misunderstood but, did you made sure that prefetch is also ran before parallel-fastq-dump ?

also, if you are running multiple prefetch/parallel-fastq-dump/fasterq-dump concurrently this could be affecting the results.

Midnighter commented 2 years ago

Happy to answer your questions:

I ran every process sequentially in order to not affect bandwidth.
In the plot in the readme are the times (durations) recorded of parallel-fastq-dump on one axis compared to the total duration (sum) of running prefetch + fasterq-dump on the other axis. Nextflow creates separate workspaces for each process so prefetch has no influence on parallel-fastq-dump.

rvalieris commented 2 years ago

fastq-dump (which actually does the work internally on parallel-fastq-dump) is not very good at downloading stuff, so it would be faster to run prefetch first and use parallel-fastq-dump just to dump the sra, but I guess it depends on how you want to compare, you could:

run both parallel-fastq-dump and fasterq-dump without prefetch, to compare download+dumping times
run both parallel-fastq-dump and fasterq-dump with prefetch first, to compare just the dumping times

its already expected that downloading with fastq-dump is worst than downloading with prefetch.

Midnighter commented 2 years ago

I see, I had understood parallel-fastq-dump as a "complete package" so I hadn't considered using prefetch first.

rvalieris commented 2 years ago

yes, and I bet many people run parallel-fastq-dump without realizing this (despite being explained on the readme), so I wouldn't say its an "unfair" comparison.

another interesting point is using --gzip or --bzip2 on parallel-fastq-dump, for big SRAs the size difference between compressed and uncompressed fastq can be very big, so writing compressed files could finish faster (maybe ? idk), the problem is that fasterq-dump doesn't support writing compressed files so you cant compare 1:1

Midnighter commented 2 years ago

another interesting point is using --gzip or --bzip2 on parallel-fastq-dump, for big SRAs the size difference between compressed and uncompressed fastq can be very big, so writing compressed files could finish faster (maybe ? idk), the problem is that fasterq-dump doesn't support writing compressed files so you cant compare 1:1

I think this will be a next step. My plan is to compare the speed of compressed output from parallel-fastq-dump with running fasterq-dump + pigz for parallel compression. Do you know what the compression level is in fastq-dump --gzip?

rvalieris commented 2 years ago

Do you know what the compression level is in fastq-dump --gzip?

no idea, but I guess leaving pigz on the default would be reasonable.

Midnighter commented 2 years ago

Alright, I've made a new benchmark first using prefetch for all tools and also including compression. I have summarized the results in the readme https://github.com/UnseenBio/sra-demo-benchmark/tree/benchmark-prefetch

rvalieris commented 2 years ago

very interesting, a few comments in no order in particular:

maybe the variance of cpu/mem can be explained by the sra size
maybe you can repeat each sra 3 times and take the mean/median to account for weird stuff
does this trend maintains if you use 8 threads ? 16 threads ?
would parallel-fastq-dump+pigz be faster ?

dont take this as criticism, its just things that I thought while looking at the plots.

Midnighter commented 2 years ago

I think you make very valid points. I'm currently limited to one desktop computer, though 🙂 So the combinatorial increase in jobs and threads is a bit much for that. If you have more resources available, I'm happy to adjust the pipeline accordingly so that you can run it with one command.

maybe the variance of cpu/mem can be explained by the sra size

This is actually a very minor factor to me. I only made those plots since the data was there 😉 Yes, the distributions are almost bimodal because of the small and large sequences. A different set of input IDs that has a better representation over a large range could be nice indeed.

would parallel-fastq-dump+pigz be faster ?

Possibly, I guess ideal scenario would be for each process/thread to write compressed output directly in a compatible way so that everything can be stitched together. Or maybe that is how it currently happens? I also still don't know the compression level of fastq-dump --gzip so it might be higher than pigz's default.

@wraetz (I assume from NCBI) also talked about possible differences depending on the alignment of the SRA files. Although I don't have a relevant set of IDs to test that yet.

rvalieris commented 2 years ago

yeah I understand adding more comparisons increases the load significantly, I was just wondering really.

Possibly, I guess ideal scenario would be for each process/thread to write compressed output directly in a compatible way so that everything can be stitched together. Or maybe that is how it currently happens?

I'm not sure what you're getting at but, each fastq-dump --gzip writes a separate gzip file concurrently, all parallel-fastq-dump does is cat the compressed files in the correct order in the end, yes it works as expected.

what I was suggesting tho, is to test your hyphotesis that fasterq-dump+pigz is faster due to pigz, you could run parallel-fastq-dump uncompressed + pigz.

rvalieris / parallel-fastq-dump

Benchmark comparison #41