sourmash-bio / sourmash

Quickly search, compare, and analyze genomic and metagenomic data sets.
http://sourmash.readthedocs.io/en/latest/
Other
472 stars 80 forks source link

brainstorming: alternative signature storage/loading/query formats #1262

Open ctb opened 3 years ago

ctb commented 3 years ago

from https://github.com/dib-lab/sourmash/issues/1226#issuecomment-748382043, @luizirber says:

The challenge then becomes recalculating the reference sigs with smaller scaled values (100?), and efficiently storing it. JSON + gzip for sigs is at the limit for sizes, but not sure what would be a good format that maintains good archival/self-describing/easy to parse/small trade-offs.

a couple of thoughts here -

luizirber commented 3 years ago

I feel like we've benefited a lot from using a really boring standard format like JSON which has lots of tools & language support

Yup, I agree.

so binary formats are fine if they have said tool & language support, but I'm not "up" on what binary formats are good - maybe protocol buffers are an option?

I would really like to avoid protobuf (eg https://twitter.com/fasterthanlime/status/1340944948582113282). On the Rust side, serde has support for a bunch of formats, but performance-wise it would be better to have something that doesn't require encoding/decoding for usage (zero-copy deserialization like cap'n proto, also used by mash, or rkyv, which is rust-only), but that is not as flexible as JSON...

(Tree-buf looks REALLY interesting, but still hasn't support for other languages)

alt, I wonder if we could have a database in format that supports the kind of queries we want to do? e.g. in #821 I suggest sqlite.

Mixed feelings. I think it is a good idea when compared to using Zip files for databases, but not so sure about single signatures...

Relevant read: https://www.sqlite.org/affcase1.html

ctb commented 3 years ago

what about AVRO? https://avro.apache.org/

luizirber commented 3 years ago

what about AVRO? https://avro.apache.org/

This is probably very easy to test, considering that https://github.com/flavray/avro-rs supports serde, and so it is a drop-in replacement in the current codebase.

I was looking more into the Arrow/Parquet direction, which would also make it easier to work with more data-analysis-like workflows (loading into pandas, and so on).

Another direction to consider: in #1221 I was using the bitmagic serialization/deserialization for saving nodegraphs, but it might be also a good representation for scaled minhash sketches (save a "compressed bitmap" of the hashes, instead of a list). bitmagic is not a good portable format, but I wonder if any of the options mentioned here support something along the bitmap idea. (this can make a GIGANTIC difference for very large sketches).

luizirber commented 3 years ago

I started playing with the easy ones (the formats supported by serde) in https://github.com/luizirber/2021-02-11-sourmash-binary-format, will report when I have more results.

ctb commented 2 years ago

thoughts stemming from all the manifest work that has happened:

between the recent introduction of StandaloneManifestIndex https://github.com/sourmash-bio/sourmash/pull/1891 and the hopefully-soon merge of SQLite manifests in #1808, we have an increasingly clean separation between metadata (manifests) and sketches (things containing actual hashvals). This separation would seem to make it easier to experiment with non-JSON formats in the primary code base.

there's also the idea of storing sketches in kProcessor kDataFrames or other k-mer-specialized formats.

ctb commented 2 years ago

side note: it would be neat to find ways of avoiding even reading or adding hashes (e.g. store them in bands https://github.com/sourmash-bio/sourmash/issues/1578, or hierarchically at different scaled values).

ctb commented 2 years ago

briefly looked into Roaring Bitmaps,

https://roaringbitmap.org/about/

which has both rust and python bindings.

however, while the roaring library and roaring-rs both seem to support 64-bit numbers, pyroaring does not yet - https://github.com/Ezibenroc/PyRoaringBitMap/issues/58

update - also see https://pypi.org/project/roroaring64/ which supports deserialization but not serialization.

and also https://pypi.org/project/pilosa-roaring/ which primarily (only?) supports serialization and deserialization. not clear if it supports 64 bits.

and also https://github.com/sunzhaoping/python-croaring/ which is a cffi wrapping? but does not support 64 bits.

luizirber commented 2 years ago

which has both rust and python bindings.

I'll do a quick check on the rust one for mastiff, I really liked the API!

At the moment #2230 is using rkyv to serialize/deserialize the list of datasets containing a hash, and while that process is fast it is using a regular BTreeMap from the Rust stdlib, which doesn't save much space.

(rkyv is fast, but it has its own binary format, which precludes using it in other languages. roaring bitmaps are well supported in many languages)

luizirber commented 2 years ago

Seems like roaring is smaller and faster than rkyv on a first test, will try more extensive benchmarks soon.

branch: https://github.com/sourmash-bio/sourmash/compare/lirber/mastiff...lirber/mastiff_roaring

luizirber commented 2 years ago

Seems like roaring is smaller and faster than rkyv on a first test, will try more extensive benchmarks soon.

branch: lirber/mastiff...lirber/mastiff_roaring

Caveat on roaring: it only stores presence/absence, so it doesn't work as a replacement for abundance. But I think we can still use a Vec for storing abundances, and call mins.rank(hash) to get the position to get/set in the abundances Vec.

ctb commented 1 year ago

Using the plugin architecture in #2428, I put together a small, minimally function Avro reader/writer here.

It works! Which seems like a good start 😆

It's in Python so it's really more for demo and exploration than for production, of course, but it's a piece of the puzzle.

ctb commented 1 year ago

flatbuffers: https://google.github.io/flatbuffers/

ctb commented 1 year ago

I thought this was a nice 'splainer about parquet vs avro - https://stackoverflow.com/questions/28957291/avro-vs-parquet - tl;dr avro is row-based, like CSV (but with more complicated rows); parquet is column based.

I think that means that parquet would be a better choice for manifests, where you might want to select on a few specific columns? While avro is essentially a replacement for JSON in the way we do things internally.

ctb commented 1 year ago

maybe relevant?

pandas 2.0 and the Arrow revolution (part I) - Marc Garcia

Pandas 2.0 is going to have an Apache Arrow backend for data. This is going to eventually be a pretty big deal for large or complex data analyses - and not just because it’ll be faster, and has better data-type and missing-value handling. It will mean the in-memory data representation is now compatible (and can be used in place) by a wide range of other tools - databases (duckDB), analysis and plotting tools, file handling tools… Garcia goes much deeper into this.

(from RCT #158)

ctb commented 1 month ago

bincode? https://github.com/bincode-org/bincode

note: https://github.com/sourmash-bio/sourmash_plugin_branchwater/pull/455