Open ctb opened 9 months ago
note connection to suggestions at bottom of https://github.com/sourmash-bio/sourmash/issues/416
I have questions - are the dataclasses caching results, or recomputing them multiple times?? -- and the people demand answers!
OK, https://github.com/sourmash-bio/sourmash/pull/2962 tackles this for just sourmash gather
and multigather
PrefetchResult
is immediately released, so sourmash prefetch
doesn't suffer from this problem.
looks like SearchResult
may run afoul of this, however. But it's used in relatively minimal ways so far.
https://github.com/sourmash-bio/sourmash/pull/2962 addresses the memory usage, but not the underlying problem. From the PR:
Ultimately, a better fix is needed - probably one that changes up the dataclasses so that they don't store MinHashes - but such a fix is beyond me at the moment.
over in https://github.com/ctb/2024-calc-full-gather/ I have implemented a simple script that takes fastgather output (from https://github.com/sourmash-bio/sourmash_plugin_branchwater/) and turns it into full gather output without redoing the searches - it literally just trusts the rank and match information from fastgather completely, and calculates all the stats.
this was easier than I expected because of the very nice
GatherResult
refactoring that @bluegenes did a while back in #1955!however it also revealed that #1955 probably added significantly to the memory footprint of gather, because the
GatherResult
dataclasses keep sketches in memory and they are retained throughout the full gather process.I figured this out when I noticed that my
calc-full-gather
script was running out of memory in the same way thatgather
was running out of memory, and in https://github.com/ctb/2024-calc-full-gather/commit/a09215ec5b70401e95c0348ba64ede11a1bb9b33 I fixed it by discarding theGatherResult
objects after each result. It's now nice and low memory (if not exactly fast ;) - see https://github.com/sourmash-bio/sourmash/pull/2943.I am also wondering if perhaps
PrefetchResult
has the same problem inprefetch
?We should fix the gather code in sourmash to be lower memory.
We probably need to do some kind of regression testing that tracks memory usage and the like, too.
viz https://github.com/sourmash-bio/sourmash_plugin_branchwater/issues/187, https://github.com/sourmash-bio/sourmash/pull/2943.