sourmash-bio / sourmash

Quickly search, compare, and analyze genomic and metagenomic data sets.
http://sourmash.readthedocs.io/en/latest/
Other
466 stars 79 forks source link

speeding up SQLite classes - notes on future work #1972

Open ctb opened 2 years ago

ctb commented 2 years ago

1808 adds support for SQLite, and is well enough tested (IMO :) that we could start fearlessly refactoring.

see blog post, http://ivory.idyll.org/blog/2022-storing-ulong-in-sqlite-sourmash.html, for some more background.

ideas for future optimization -

ctb commented 2 years ago

as a side note for @luizirber, for scaled = 2, you actually have precisely one hash that is larger than MAX_SQLITE_INT, which is why I didn't push more on this in #1808!

>>> import sourmash
>>> mh = sourmash.MinHash(0, 31, scaled=2)
>>> mh._max_hash
9223372036854775808
>>> 2**63 -1
9223372036854775807
>>> mh.add_hash(9223372036854775808)
>>> len(mh)
1
>>> mh.add_hash(9223372036854775807)
>>> len(mh)
2