mbhall88 / rasusa

Randomly subsample sequencing reads or alignments
https://doi.org/10.21105/joss.03941
MIT License
211 stars 17 forks source link

Suggestion: replace needletail by noodles and niffler #25

Closed natir closed 1 day ago

natir commented 3 years ago

Hello very nice work.

Needletail is very nice crate, but if I didn't made any mistake you use it only for fastx parsing, you didn't use any other functionality.

Noodles is crate provide many bioinformatics parser and system of functionality to get only what you need. By switch to noodles you can reduce the number of dependency of rasusa and speedup compilation time. I didn't made a full benchmark but noodles and needletail have almost same code.

If you want keep similar functionality (support compression) you need also add niffler, niffler provide a simple and transparent support for compressed files.

Again very nice work.

mbhall88 commented 3 years ago

Hey @natir.

Yes, I only just switched the parsing to needletail. Given I only just switched the parser I probably won't get around to switching it again anytime soon. Also, compile time isn't a major concern for me, especially since I distribute pre-compiled binaries and a bunch of other methods that mean users don't need to compile the project. I'm happy to review a PR with updated benchmark though.

Regarding niffler, you've made me realise somewhere along the line I have lost the compressed output functionality of this tool... Originally rasusa would infer the desired output compression from the path. I'll have to fix that.

mbhall88 commented 1 day ago

I have always been interested in what the speed difference is between these two. So I did a small benchmark

tl;dr needletail is faster

Benchmarking needletail FASTQ parsing: Warming up for 3.0000 s
Warning: Unable to complete 20 samples in 5.0s. You may wish to increase target time to 49.7s, or reduce sample count to 10.
needletail FASTQ parsing
                        time:   [2.4155 s 2.4311 s 2.4522 s]
                        change: [-0.2443% +0.4751% +1.3345%] (p = 0.18 > 0.05)
                        No change in performance detected.
Found 3 outliers among 20 measurements (15.00%)
  2 (10.00%) high mild
  1 (5.00%) high severe

Benchmarking noodles-fastq FASTQ parsing: Warming up for 3.0000 s
Warning: Unable to complete 20 samples in 5.0s. You may wish to increase target time to 157.0s, or reduce sample count to 10.
noodles-fastq FASTQ parsing
                        time:   [7.6495 s 7.7057 s 7.7732 s]
Found 2 outliers among 20 measurements (10.00%)
  2 (10.00%) high mild