Closed ctb closed 3 years ago
there's another issue out there about the two different rust implementations of hash storage that I can't find at the moment, that could factor into this.
Probably https://github.com/dib-lab/sourmash/pull/1045?
But overall +1000 on this. We can unlock a lot of cool optimizations if we don't need to mutate the MinHash
sketches, from reducing memory needed to store the data, to faster intersection/union calculations, and as #1045 showed, we can also optimize for change-heavy code without undoing the optimization from read-heavy code.
actually this one https://github.com/dib-lab/sourmash/issues/1055, linked from #1045.
This is another one (like #1354) that can benefit from refactoring done at the Python layer to then guide the way to improved Rust design.
this was done in #1508, which continues to be a mostly innocuous change so far! I'm going to leave this open for a bit tho, to see if there's more to be done.
closing - #1508 hasn't really caused any problems.
As I write more/deeper code around search and prefetch and so on, I am starting worry about accidentally modifying
MinHash
objects.In brief -
MinHash
objects loaded from signature files can and should not be changed in any way - my suspicion is that in most of the code, immutableMinHash
objects are probably the right thing to be used!flatten
anddownsample
, which could similarly return immutableMinHash
objects.MinHash
- in addition to preventing/discovering bugs, presumably there are potential performance improvements on the Python side, and I'm also guessing that defaulting to immutableMinHash
objects over on the rust side could yield massive performance benefits.related to the idea of diversifying
MinHash
objects to differentiate betweennum
andscaled
objects https://github.com/dib-lab/sourmash/issues/1354there's another issue out there about the two different rust implementations of hash storage that I can't find at the moment, that could factor into this.