sourmash-bio / sourmash

Quickly search, compare, and analyze genomic and metagenomic data sets.
http://sourmash.readthedocs.io/en/latest/
Other
471 stars 80 forks source link

some thoughts on saving/loading/selecting `SourmashSignature` #1647

Open ctb opened 3 years ago

ctb commented 3 years ago

I've been digging into some Storage stuff, and thinking about:

and also how right now we really have no actual unique handle for either a SourmashSignature or a MinHash object, since the md5sum is calculated on the MinHash object and doesn't take the signature name into account.

In https://github.com/sourmash-bio/sourmash/issues/616 we talk about how signatures and MinHash objects are tightly tied together pretty clearly, but the situation has not been improved by selectors and manifests and picklists ;).

This also all gets in the way of storing related MinHash objects in a single SourmashSignature / leaf node in an SBT per https://github.com/sourmash-bio/sourmash/issues/198.

And, more generally, this also prevents us from supporting multiple different sketch types. We don't really have any yet (beyond num/scaled signatures, and maybe noabund/abund), but it would be nice to support it, which was the goal of https://github.com/sourmash-bio/sourmash/issues/1514.

So I'm thinking about slowly moving in the following direction:

One end result would be that things like MinHash and select would become much less visible at the top level in the code.

A hack I was thinking of implementing is the idea of a sequence as a sketch type, where we can store actual FASTA sequences and/or collections of k-mers as a signature. It sounds kinda stupid, but could be a good proof of concept in the current absence of different sketches.

luizirber commented 3 years ago

+100

HLL and Nodegraph are also good candidates for different sketches, but I like the idea of using the sequence as a sketch type too!

ctb commented 2 years ago

as a side note, we could totally use SqliteIndex in #1808 as a signature storage for SBTs, but this breaks my brain a little at the moment.

ctb commented 2 years ago

but what I really came here to say was that storing FASTA/FASTQ in sqlite might be one way to go in terms of providing FASTA/FASTQ as a sketch type. In particular, using storage converters (see this and this) with gzip compression could work for efficient on-disk storage and retrieval of large FASTA sequences.

And, while thinking about that, it might actually make some vague sense to support sequence storage directly inSqliteIndex as optional columns in the sketches table. Then you would have both the hashes and the actual sequence in there 🤯, and would only "suffer" the file size and load time penalties when you used them.

ctb commented 2 years ago

also see sqlite-zstd: https://phiresky.github.io/blog/2022/sqlite-zstd/ - for in-database compression.

ctb commented 1 year ago

Implemented a fun little hack here: https://github.com/ctb/sourmash_plugin_load_from_fasta

It supports loading FASTA/FASTQ files as Index objects, and does lazy sketching, i.e. sketches only when a signature is actually requested.