mbhall88 / rasusa

Randomly subsample sequencing reads or alignments
https://doi.org/10.21105/joss.03941
MIT License
211 stars 17 forks source link

`HashSet` is slow #26

Closed natir closed 3 years ago

natir commented 3 years ago

Hello,

During my small reading of rasusa code I see a possible speed up, but before made a pull request I prefer discuss it because we have multiple choices with different draw back.

Rasusa use a HashSet to store indices of selected read, we could find faster solution.

By default the rust hashing algorithm is resistance against HashDoS attacks, this could be useful in some situation but I think it's not the case for rasusa. This resistance have a cost in term of computation time, you could replace standard HashSet by rustc_hash and fxhash this crates provide drop in replacement Set struct but they are faster.

In fact we can replace HashSet by a Vec<bool> with a true or false if the current indices is selected or not. This way have an higher memory cost (1 bytes per indices), but I'm pretty sure this solution is faster. To reduce memory usage to 1 bit per indices we can use crate bitvec.

What is your opinion on this trouble and possible solution, if you want I can write benchmark to compare these solutions (I like benchmark).

Best

mbhall88 commented 3 years ago

Thanks for this @natir. I appreciate the suggestions!

I had originally intended to use a bit vector actually. The reason I decided on a set though was because of the following reason:
When I am outputting the reads that we are keeping in the subsample, we remove the index from the set. So the set gradually shrinks to zero. When the set is empty we exit the subsampling function https://github.com/mbhall88/rasusa/blob/1135210f5173b4cfc48f01703be78777862d88f9/src/fastx.rs#L177-L179

This way we save a small amount of time (theoretically) by not iterating through the entire file again - unless the last read in the file is part of the subsample.

However, upon reflection, a bit vector may still be faster as the lookup time is probably faster in real terms, as we do not need to hash anything and as you've pointed out, hashing can be expensive. (Although I suspect it is only on the order if milli/micro seconds overall).

I would be interest to see a benchmark between the three approaches: current, set with different hashing, and bit vector. However, I appreciate that is a bit of work. If you want to put together a benchmark I would very much appreciate it and would happily take whatever approach is the fastest - provided memory usage stays low(ish).

natir commented 3 years ago

First of all I made a simple profiling analysis, with flamegraph 10 % of rasusa runtime is spent in HashSet contains and remove. So it's not where rasusa spent most time but improve it is easy and have an impact.

I write a small benchmark in my branch indices_storage I try to be closer as possible to rasusa process.

First I compare different method to create indices storage (0.2 % of rasusa runtime). hash: use classic HashSet rustc_hash: use rustc_hash::FxHashSet vector: store boolean in vector true if indices must be select. bitvec: similar to vector but use a bitvec (I'm not happy with this implementation)

create/hash             time:   [235.87 us 238.62 us 242.05 us]
create/rustc_hash       time:   [86.565 us 87.738 us 89.163 us]
create/vector           time:   [13.918 us 13.978 us 14.044 us]
create/bitvec           time:   [732.96 us 748.85 us 765.80 us]

Faster method is vector, 17 times than hash. About memory usage hash and rustc_hash memory usage is harder to estimate but it's linear in number of selected read, for vector and bitvec it's linear in total number of read. For vector and bitvec we can save some ram by truncate to last selected value.

Second I compare method to read indices storage structure (10 % of rasusa runtime). remove: when we reach selected indices we remove it from HashSet, when set is empty we stop iteration count: when we reach a selected indices we just decrease a counter, when the counter reach 0 we stop iteration full: we iterate over all vector

read/hash_remove        time:   [2.1765 ms 2.2002 ms 2.2265 ms]
read/hash_count         time:   [2.0811 ms 2.1786 ms 2.2642 ms]
read/rustc_hash_remove  time:   [834.15 us 844.61 us 856.79 us]
read/rustc_hash_count   time:   [450.59 us 460.17 us 470.73 us]
read/vector_full        time:   [130.23 us 131.96 us 133.79 us]
read/vector_count       time:   [173.10 us 177.20 us 181.82 us]
read/bitvec_full        time:   [442.90 us 458.02 us 475.87 us]
read/bitvec_count       time:   [396.12 us 403.98 us 412.27 us]

Again vector is faster, 16 times than hash. I assume full method is faster than count method because these methods avoid a test.

I think vector is a very good solution and memory usage draw back isn't too large (1 bytes per reads), but if you prefer keep set approach replace standard set by rustc set is easy and efficient (~ 5 times faster).

mbhall88 commented 3 years ago

Thank you so much for this @natir! This is amazing. It seems a plain ol' vector is definitely the best solution. Even with 1 byte per read, the file would have to have a billion reads (1GB) for the memory to really be of any concern to be.

If you would like to switch the implementation to use a vector I will gladly accept a PR, otherwise I will try and do this in the next week or two.