It's kind of janky, but the underlying idea is sound and probably something that people do more robustly with clever Bloom filter indexing schemes -
take a sketch with many hashes
subdivide it into as sketches of size (say) 10000 as need be, all with the same signature name and filename
save those to a zip file as independent sketches
use sourmash index to index that zip file into an SBT
you should end up with a much more balanced SBT, and as long as you don't do any thresholding and only care about the name/filename match, this technique should work. Those are big caveats, admittedly...
for PFAM, k=10 scaled=1, with 19k sequence files, we have 427150 sketches under 10k in size. It's about 60 GB to build the SBT, 300 minutes of compute time. 858851 total nodes to save.
Over in https://github.com/ctb/2022-sourmash-chunk-sigs, I'm trying out an indexing scheme for large databases with extremely uneven size distribution of sketches (PFAM, in this case).
It's kind of janky, but the underlying idea is sound and probably something that people do more robustly with clever Bloom filter indexing schemes -
sourmash index
to index that zip file into an SBTyou should end up with a much more balanced SBT, and as long as you don't do any thresholding and only care about the name/filename match, this technique should work. Those are big caveats, admittedly...