mbhall88 / rasusa

Randomly subsample sequencing reads or alignments
https://doi.org/10.21105/joss.03941
MIT License
211 stars 17 forks source link

Speedup by replace `HashSet` by a `Vec<bool>` #28

Closed natir closed 3 years ago

natir commented 3 years ago

Hello,

According to #26 I create this pull request to replace HashSet indices storage by a Vec of boolean.

I made a simple benchmark by keeping 250x of 500x of nanopore simulated reads of E. coli k12, with a fixed seed (30 samples).

min [s] max [s] mean [s] relative
set 5.12 11.08 8.507 1
vec 3.82 11.48 7.674 0.90

I Indicate in issue #26 HashSet operation take 10 % of rasusa run time (I use same parameter), replace HashSet by Vec<bool> save this time. Output are identical.

Some points need discussion:

  1. I can remove benchmark if you want.
  2. methode indices return a Vec<bool> and the number of read selected (as usize), because we can't get the number of reads selected by get the length of Vec
  3. I change filter_reads_into signature to add nb_reads_keep parameter to keep previous behavior. But with this pull request we iterate over all input reads, so I think we can remove the final filter_reads_into test. What is your opinion about this ?

I try to change test less as possible.

natir commented 3 years ago

I would like to point out that the gain in computation time is certainly very dependent on the number of reads selected.

natir commented 3 years ago

Check failed because bitvec minimum supported rust version is 1.47.0.

So I assume this answer the first point, I must remove benchmark or at least the part with bitvec.

mbhall88 commented 3 years ago

Thank you very much @natir

Here is the benchmark from the readme using this version and the previous version against seqtk

$ hyperfine --warmup 10 --runs 50 --export-markdown results-paired.md \
     "$SEQTK_CMD_1" "$SEQTK_CMD_2" "$RASUSA_CMD" "$INDICES_CMD"
Benchmark #1: seqtk sample -s 1 r1.fq 147052 > /tmp/r1.fq; seqtk sample -s 1 r2.fq 147052 > /tmp/r2.fq;
  Time (mean ± σ):     693.8 ms ±  17.4 ms    [User: 398.0 ms, System: 293.1 ms]
  Range (min … max):   678.8 ms … 780.5 ms    50 runs

  Warning: Statistical outliers were detected. Consider re-running this benchmark on a quiet PC without any interferences from other programs. It might help t
o use the '--warmup' or '--prepare' options.

Benchmark #2: seqtk sample -2 -s 1 r1.fq 147052 > /tmp/r1.fq; seqtk sample -2 -s 1 r2.fq 147052 > /tmp/r2.fq;
  Time (mean ± σ):     739.5 ms ±  28.4 ms    [User: 502.9 ms, System: 235.1 ms]
  Range (min … max):   695.6 ms … 811.8 ms    50 runs

Benchmark #3: ./rasusav042 -i r1.fq r2.fq -c 20 -g 4411532 -s 1 -o /tmp/r1.fq -o /tmp/r2.fq
  Time (mean ± σ):     655.4 ms ±  84.9 ms    [User: 191.5 ms, System: 300.3 ms]
  Range (min … max):   473.6 ms … 824.7 ms    50 runs

Benchmark #4: ./indices -i r1.fq r2.fq -c 20 -g 4411532 -s 1 -o /tmp/r1.fq -o /tmp/r2.fq
  Time (mean ± σ):     574.7 ms ±  88.1 ms    [User: 128.2 ms, System: 284.6 ms]
  Range (min … max):   432.3 ms … 784.4 ms    50 runs

Summary
  './indices -i r1.fq r2.fq -c 20 -g 4411532 -s 1 -o /tmp/r1.fq -o /tmp/r2.fq' ran
    1.14 ± 0.23 times faster than './rasusav042 -i r1.fq r2.fq -c 20 -g 4411532 -s 1 -o /tmp/r1.fq -o /tmp/r2.fq'
    1.21 ± 0.19 times faster than 'seqtk sample -s 1 r1.fq 147052 > /tmp/r1.fq; seqtk sample -s 1 r2.fq 147052 > /tmp/r2.fq;'
    1.29 ± 0.20 times faster than 'seqtk sample -2 -s 1 r1.fq 147052 > /tmp/r1.fq; seqtk sample -2 -s 1 r2.fq 147052 > /tmp/r2.fq;'

So using a Vec<bool> is indeed a little bit fast than using a HashSet