sourmash-bio / sourmash

Quickly search, compare, and analyze genomic and metagenomic data sets.
http://sourmash.readthedocs.io/en/latest/
Other
446 stars 78 forks source link

update `GatherResult` to no longer store MinHashes #2950

Open ctb opened 5 months ago

ctb commented 5 months ago

over in https://github.com/ctb/2024-calc-full-gather/ I have implemented a simple script that takes fastgather output (from https://github.com/sourmash-bio/sourmash_plugin_branchwater/) and turns it into full gather output without redoing the searches - it literally just trusts the rank and match information from fastgather completely, and calculates all the stats.

this was easier than I expected because of the very nice GatherResult refactoring that @bluegenes did a while back in #1955!

however it also revealed that #1955 probably added significantly to the memory footprint of gather, because the GatherResult dataclasses keep sketches in memory and they are retained throughout the full gather process.

I figured this out when I noticed that my calc-full-gather script was running out of memory in the same way that gather was running out of memory, and in https://github.com/ctb/2024-calc-full-gather/commit/a09215ec5b70401e95c0348ba64ede11a1bb9b33 I fixed it by discarding the GatherResult objects after each result. It's now nice and low memory (if not exactly fast ;) - see https://github.com/sourmash-bio/sourmash/pull/2943.

I am also wondering if perhaps PrefetchResult has the same problem in prefetch?

We should fix the gather code in sourmash to be lower memory.

We probably need to do some kind of regression testing that tracks memory usage and the like, too.

viz https://github.com/sourmash-bio/sourmash_plugin_branchwater/issues/187, https://github.com/sourmash-bio/sourmash/pull/2943.

ctb commented 5 months ago

note connection to suggestions at bottom of https://github.com/sourmash-bio/sourmash/issues/416

ctb commented 5 months ago

I have questions - are the dataclasses caching results, or recomputing them multiple times?? -- and the people demand answers!

ctb commented 5 months ago

OK, https://github.com/sourmash-bio/sourmash/pull/2962 tackles this for just sourmash gather and multigather

ctb commented 5 months ago

PrefetchResult is immediately released, so sourmash prefetch doesn't suffer from this problem.

looks like SearchResult may run afoul of this, however. But it's used in relatively minimal ways so far.

ctb commented 5 months ago

https://github.com/sourmash-bio/sourmash/pull/2962 addresses the memory usage, but not the underlying problem. From the PR:

Ultimately, a better fix is needed - probably one that changes up the dataclasses so that they don't store MinHashes - but such a fix is beyond me at the moment.