Closed natir closed 3 years ago
I would like to point out that the gain in computation time is certainly very dependent on the number of reads selected.
Check failed because bitvec minimum supported rust version is 1.47.0.
So I assume this answer the first point, I must remove benchmark or at least the part with bitvec.
Thank you very much @natir
Here is the benchmark from the readme using this version and the previous version against seqtk
$ hyperfine --warmup 10 --runs 50 --export-markdown results-paired.md \
"$SEQTK_CMD_1" "$SEQTK_CMD_2" "$RASUSA_CMD" "$INDICES_CMD"
Benchmark #1: seqtk sample -s 1 r1.fq 147052 > /tmp/r1.fq; seqtk sample -s 1 r2.fq 147052 > /tmp/r2.fq;
Time (mean ± σ): 693.8 ms ± 17.4 ms [User: 398.0 ms, System: 293.1 ms]
Range (min … max): 678.8 ms … 780.5 ms 50 runs
Warning: Statistical outliers were detected. Consider re-running this benchmark on a quiet PC without any interferences from other programs. It might help t
o use the '--warmup' or '--prepare' options.
Benchmark #2: seqtk sample -2 -s 1 r1.fq 147052 > /tmp/r1.fq; seqtk sample -2 -s 1 r2.fq 147052 > /tmp/r2.fq;
Time (mean ± σ): 739.5 ms ± 28.4 ms [User: 502.9 ms, System: 235.1 ms]
Range (min … max): 695.6 ms … 811.8 ms 50 runs
Benchmark #3: ./rasusav042 -i r1.fq r2.fq -c 20 -g 4411532 -s 1 -o /tmp/r1.fq -o /tmp/r2.fq
Time (mean ± σ): 655.4 ms ± 84.9 ms [User: 191.5 ms, System: 300.3 ms]
Range (min … max): 473.6 ms … 824.7 ms 50 runs
Benchmark #4: ./indices -i r1.fq r2.fq -c 20 -g 4411532 -s 1 -o /tmp/r1.fq -o /tmp/r2.fq
Time (mean ± σ): 574.7 ms ± 88.1 ms [User: 128.2 ms, System: 284.6 ms]
Range (min … max): 432.3 ms … 784.4 ms 50 runs
Summary
'./indices -i r1.fq r2.fq -c 20 -g 4411532 -s 1 -o /tmp/r1.fq -o /tmp/r2.fq' ran
1.14 ± 0.23 times faster than './rasusav042 -i r1.fq r2.fq -c 20 -g 4411532 -s 1 -o /tmp/r1.fq -o /tmp/r2.fq'
1.21 ± 0.19 times faster than 'seqtk sample -s 1 r1.fq 147052 > /tmp/r1.fq; seqtk sample -s 1 r2.fq 147052 > /tmp/r2.fq;'
1.29 ± 0.20 times faster than 'seqtk sample -2 -s 1 r1.fq 147052 > /tmp/r1.fq; seqtk sample -2 -s 1 r2.fq 147052 > /tmp/r2.fq;'
So using a Vec<bool>
is indeed a little bit fast than using a HashSet
Hello,
According to #26 I create this pull request to replace
HashSet
indices storage by a Vec of boolean.I made a simple benchmark by keeping 250x of 500x of nanopore simulated reads of E. coli k12, with a fixed seed (30 samples).
I Indicate in issue #26
HashSet
operation take 10 % of rasusa run time (I use same parameter), replaceHashSet
byVec<bool>
save this time. Output are identical.Some points need discussion:
indices
return aVec<bool>
and the number of read selected (as usize), because we can't get the number of reads selected by get the length of Vecfilter_reads_into
signature to addnb_reads_keep
parameter to keep previous behavior. But with this pull request we iterate over all input reads, so I think we can remove the finalfilter_reads_into
test. What is your opinion about this ?I try to change test less as possible.