onecodex / finch-rs

A genomic minhashing implementation in Rust
https://www.onecodex.com
MIT License
92 stars 8 forks source link

RefSeq comparisons are slow #13

Closed bovee closed 6 years ago

bovee commented 6 years ago

It takes ~3.5 minutes to do a full comparison of a small test genome against the 315Mb k=21/n=1000 RefSeq database sketch file. While this isn't super slow, it would be nice if this was more firmly in the <1 minute range.

98.6% of this time (as determined by Instruments) is spent deserializing the sketch JSON while 0.8% is spent doing all the comparisons. :/ The easiest way to close this is to periodically check if there are any speed improvements in Serde and update appropriately. For upstream issue, see:

https://github.com/serde-rs/json/issues/160

bovee commented 6 years ago

This looks like something we can speed it up a lot by just using a BufReader in front of serde. On my machine, I see a speed-up of 17x (went to 12 sec loading that refseq_sketches_21_1000.sk file).

boydgreenfield commented 6 years ago

@bovee 12 seconds feels like something we can live with. :)