subdivide sketches into equal sized chunks for better SBTs?

Over in https://github.com/ctb/2022-sourmash-chunk-sigs, I'm trying out an indexing scheme for large databases with extremely uneven size distribution of sketches (PFAM, in this case).

It's kind of janky, but the underlying idea is sound and probably something that people do more robustly with clever Bloom filter indexing schemes -

take a sketch with many hashes
subdivide it into as sketches of size (say) 10000 as need be, all with the same signature name and filename
save those to a zip file as independent sketches
use sourmash index to index that zip file into an SBT

you should end up with a much more balanced SBT, and as long as you don't do any thresholding and only care about the name/filename match, this technique should work. Those are big caveats, admittedly...

sourmash-bio / sourmash

subdivide sketches into equal sized chunks for better SBTs? #1832