Open ctb opened 3 years ago
+100
HLL
and Nodegraph
are also good candidates for different sketches, but I like the idea of using the sequence as a sketch type too!
as a side note, we could totally use SqliteIndex
in #1808 as a signature storage for SBTs, but this breaks my brain a little at the moment.
but what I really came here to say was that storing FASTA/FASTQ in sqlite might be one way to go in terms of providing FASTA/FASTQ as a sketch type. In particular, using storage converters (see this and this) with gzip compression could work for efficient on-disk storage and retrieval of large FASTA sequences.
And, while thinking about that, it might actually make some vague sense to support sequence storage directly inSqliteIndex
as optional columns in the sketches
table. Then you would have both the hashes and the actual sequence in there 🤯, and would only "suffer" the file size and load time penalties when you used them.
also see sqlite-zstd: https://phiresky.github.io/blog/2022/sqlite-zstd/ - for in-database compression.
Implemented a fun little hack here: https://github.com/ctb/sourmash_plugin_load_from_fasta
It supports loading FASTA/FASTQ files as Index
objects, and does lazy sketching, i.e. sketches only when a signature is actually requested.
I've been digging into some
Storage
stuff, and thinking about:SourmashSignature
and how we treat signatures andMinHash
objects as 1:1MinHash
(ksize, moltype, etc.)MinHash
(ksize, moltype, etc.)and also how right now we really have no actual unique handle for either a
SourmashSignature
or aMinHash
object, since themd5sum
is calculated on theMinHash
object and doesn't take the signature name into account.In https://github.com/sourmash-bio/sourmash/issues/616 we talk about how signatures and MinHash objects are tightly tied together pretty clearly, but the situation has not been improved by selectors and manifests and picklists ;).
This also all gets in the way of storing related
MinHash
objects in a singleSourmashSignature
/ leaf node in an SBT per https://github.com/sourmash-bio/sourmash/issues/198.And, more generally, this also prevents us from supporting multiple different sketch types. We don't really have any yet (beyond
num
/scaled
signatures, and maybenoabund
/abund
), but it would be nice to support it, which was the goal of https://github.com/sourmash-bio/sourmash/issues/1514.So I'm thinking about slowly moving in the following direction:
SourmashSignature
will become a collection of different sketch types calculated from the same underlying sequence data, and the best one for a given comparison will be chosen when a comparison is requested.Storage
will support saving and loadingSourmashSignatures
of this type, but a storage location will contain at most oneSourmashSignature
(and one or more sketches under that signature).Storage
then becomes something that stores collections of signatures whileIndex
structures like SBT and revindex move towards being a fast search index for some types of sketches in those signatures, e.g. sketches of a particular ksize/moltype. But then you can use those search indices to pull up the fullSourmashSignature
which will let you transition between different sketches on the same signature (ksize, moltype, etc.)SourmashSignature
in order to findSourmashSignature
s with compatible operations available.One end result would be that things like
MinHash
andselect
would become much less visible at the top level in the code.A hack I was thinking of implementing is the idea of a sequence as a sketch type, where we can store actual FASTA sequences and/or collections of k-mers as a signature. It sounds kinda stupid, but could be a good proof of concept in the current absence of different sketches.