sourmash-bio / sourmash

Quickly search, compare, and analyze genomic and metagenomic data sets.
http://sourmash.readthedocs.io/en/latest/
Other
466 stars 79 forks source link

memory profiling of `sourmash gather` - FYI #2340

Open ctb opened 1 year ago

ctb commented 1 year ago
mprof run sourmash gather SRR606249.k31.sig.gz /group/ctbrowngrp/sourmash-db/gtdb-rs207/gtdb-rs207.genomic-reps.dna.k31.zip --save-prefetch-csv p2.csv -o g2.csv
mprof plot -o ~/transfer/gather.png

shows three phases:

gather

ctb commented 1 year ago

it looks like the Rust ZipStorage implementation might be the cause of the steady increase in memory. I generated the below graphs with mprof by doing a prefetch() over ~2000 sketches in a ZipFileLinearIndex with something that had no matches. With the current default ZipStorage read-only implementation, I get the first graph. When I force the use of _RwZipStorage in sbt_storage.py, I get the second graph.

A quick perusal of the Rust code in src/core/src/storage.rs doesn't turn up any obvious caching or anything. @luizirber any ideas?

rs zip storage

rs-zipstorage

py zip storage

py-zipstorage

ctb commented 1 year ago

I note the complicated discussion of memory mapping in the rust ZipStorage PR here. Is this behavior maybe related to that?

ctb commented 1 year ago

more - I did use guppy3 (on the Python side) and the total heap usage in Python, at least, does not change.

ctb commented 1 year ago

I dug a bit more this morning, and saw this in the mprof docs -

memory_profiler supports different memory tracking backends including: ‘psutil’, ‘psutil_pss’, ‘psutil_uss’, ‘posix’, ‘tracemalloc’. If no specific backend is specified the default is to use “psutil” which measures RSS aka “Resident Set Size”. In some cases (particularly when tracking child processes) RSS may overestimate memory usage

which linked to this section in the psutil docs:

uss (Linux, macOS, Windows): aka “Unique Set Size”, this is the memory which is unique to a process and which would be freed if the process was terminated right now.

I ran my benchmarking script like so:

mprof run --backend psutil_uss python myprefetch.py

and got the same chart:

mprof_uss

here's the code, which only takes 20 seconds or so to run:

# run in ~ctbrown/scratch/2022-pymagsearch

import sourmash
from sourmash import sourmash_args

class empty:
    pass
obj = empty()
obj.picklist = 'mf.csv::manifest'
picklist = sourmash_args.load_picklist(obj)

sigfile = 'gtdb-rs207-k31/af7ac805.k=31.scaled=1000.DNA.dup=0.63.sig'
ss = list(sourmash.load_file_as_signatures(sigfile)).pop()
ss = ss.to_mutable()
ss.minhash = ss.minhash.flatten()

dbfile = '/group/ctbrowngrp/sourmash-db/gtdb-rs207/gtdb-rs207.genomic-reps.dna.k31.zip'

for i in range(2):
    db = sourmash.load_file_as_index(dbfile)
    db = sourmash_args.apply_picklist_and_pattern(db, picklist, None)
    print("iteration:", i)
    print("matches: ", list(db.prefetch(ss, 50000)))
ctb commented 6 months ago

this is a good discussion/approach: using a small container to verify small memory usage 😆 https://github.com/sourmash-bio/sourmash/pull/1909#issuecomment-1092307414