sourmash-bio / sourmash

Quickly search, compare, and analyze genomic and metagenomic data sets.
http://sourmash.readthedocs.io/en/latest/
Other
466 stars 79 forks source link

how much memory does `sourmash compare` need? #2299

Open ctb opened 1 year ago

ctb commented 1 year ago

@kescobo ran into some out-of-memory errors when trying to do sourmash compare on 100k genomes, and we became curious about memory usage 😆

kevin:

One thing you could do is, if you know ~how much memory is needed for a given thing per signature, you could check what's available and then throw a warning or something, but that seems like a maintenance nightmare to keep that consistent across all of the different tools

me:

it is also somewhat unclear - depends on size of signature etc. we could figure it out I’m sure but… :slightly_smiling_face:

me:

sort of surprised it crashed. 100k x 100k … that’s… 10 GB…?

kevin:

floats are 64 bits in numpy, right? So wouldn't it be x 64 ?

me:

8 bytes per float, so 80 GB. but we might use dtype32…. nope, float64. Not sure we need that much precision tho!

Kevin:

Ah right, bits vs bytes... I have 128 Gb of memory on this machine, so it should have fit - but are any of the sketches held in memory? I have no idea how python garbage collection works.

me:

I… you know I’m not sure myself. It depends on some internals that I don’t have handy in my memory.

ctb commented 1 year ago

some things we could do -

@ccbaumler this is a good benchmarking issue! some things to measure, and some things to fix!

kescobo commented 1 year ago

I should also say that I'm not 100% sure it was OOM, but it ran for quite some time, then I left for lunch, and when I returned when I returned my whole remote session had crashed (I was using wezterm connect).

OOM seems like the most likely explanation to me. But if sketches are held in memory, assuming ~80Gb for the matrix itself, and assuming there were some other processes running requiring some memory, 40Gb remaining / 100k sketches means sketches would only have to be on the order of 400Kb each in memory to cause a problem. Is this plausible?

Another idea would be to give the option of reading sketches from disk each time to save memory at the expense of speed, though if you don't have mmaped sketches and have to scan from the beginning each time, this could get real slow.

ctb commented 1 year ago

this might also be a place where we just say ... "hey, have you considered using kspider or gatherly?" cc @mr-eyes

vinisalazar commented 12 months ago

Hi,

I am trying to run sourmash compare on 1k metagenome signatures (ranging between 5-150MB of size) and I am constantly getting OOM errors. I was going to open a new issue, but came across this one and thought I should comment.

Here are some configurations (RAM and number of processes) I've tried:

Any ideas on what might be happening? Do I just need to throw more memory at it?

Here are some example outputs of my job stats:

State: FAILED (exit code 1)
Cores: 1
CPU Utilized: 00:03:31
CPU Efficiency: 28.75% of 00:12:14 core-walltime
Job Wall-clock time: 00:12:14
Memory Utilized: 78.06 GB
Memory Efficiency: 81.32% of 96.00 GB
State: FAILED (exit code 1)
Nodes: 1
Cores per node: 64
CPU Utilized: 01:58:51
CPU Efficiency: 0.27% of 31-01:40:16 core-walltime
Job Wall-clock time: 11:39:04
Memory Utilized: 253.37 GB
Memory Efficiency: 98.97% of 256.00 GB

Thank you for any assistance you can provide, Vini

ctb commented 12 months ago

ahh, metagenome signatures 😱 . It's not completely surprising to me b/c I would guess that most of the memory usage is being taken up by just loading the sketches into memory. Still, ...suboptimal.

Some details that would be helpful in terms of providing guidance

thanks!

One specific recommendation: raise the scaled value when doing compare, e.g. sourmash compare --scaled 10000 .... The results will not be significantly different for metagenomes of this size!

You can also try pyo3_branchwater's multisearch command which should be much more memory efficient but this is still early-stage so, you know, buyer beware 😆 . The output is also not as convenient as sourmash compare's for viz, since it's just a sparse set of places where an above-threshold comparison was found.

Last but not least, I'm not sure where @mr-eyes kSpider is, but it was built for this purpose, so: https://github.com/sourmash-bio/sourmash/issues/2271 and https://dib-lab.github.io/kSpider/

OK, actually last: check out https://github.com/sourmash-bio/sourmash/issues/2735. I'm challenged by the notion of using sourmash compare for metagenomes. 🤷

mr-eyes commented 12 months ago

kSpider@dev branch has the most recent updates (yet no docs for it), but I am happy to help build/run it until released. Also, I think branchwater/multisearch can tackle this.

The way sourmash compare works will not be convenient except for a small number of small signatures.

mr-eyes commented 12 months ago

And here's a bonus script for converting branchwater/multisearch results to a Newick file. https://github.com/sourmash-bio/pyo3_branchwater/issues/111

mr-eyes commented 12 months ago

You can also modify the script in https://github.com/sourmash-bio/pyo3_branchwater/issues/111 to generate similar output to sourmash compare. You will simply need to do pandas to_csv.

vinisalazar commented 12 months ago

Hi @ctb and @mr-eyes, thank you for the detailed answers.

what's the biggest and smallest sketch by file size? (output of ls -lS | head -2 and ls -lS | tail -2

# Biggest file
-rw-r--r-- 1 viniws punim1293 148413087 Sep 12 21:05 SRS2329668_T1.sig

# Smallest file
-rw-r--r-- 1 viniws punim1293   1443785 Sep 12 21:04 SRS954962_T1.sig

for the biggest and smallest sketch file, what does sourmash sig summarize report for them?

(bio) spartan ➜ sourmash sourmash sig summarize SRS2329668_T1.sig

== This is sourmash version 4.8.3. ==
== Please cite Brown and Irber (2016), doi:10.21105/joss.00027. ==

** loading from 'SRS2329668_T1.sig'
path filetype: MultiIndex
location: SRS2329668_T1.sig
is database? no
has manifest? yes
num signatures: 1
** examining manifest...
total hashes: 8530366
summary of sketches:
   1 sketches with DNA, k=31, scaled=1000             8530366 total hashes

(bio) spartan ➜ sourmash sourmash sig summarize SRS954962_T1.sig

== This is sourmash version 4.8.3. ==
== Please cite Brown and Irber (2016), doi:10.21105/joss.00027. ==

** loading from 'SRS954962_T1.sig'
path filetype: MultiIndex
location: SRS954962_T1.sig
is database? no
has manifest? yes
num signatures: 1
** examining manifest...
total hashes: 82957
summary of sketches:
   1 sketches with DNA, k=31, scaled=1000             82957 total hashes

what's the command line you're using?

time sourmash compare -p 24 --distance-matrix --csv sourmash_compare.csv -o sourmash_compare.out *.sig

== This is sourmash version 4.8.3. ==
== Please cite Brown and Irber (2016), doi:10.21105/joss.00027. ==

loaded 1031 signatures total.

Created memmapped siglist
Initialized memmapped similarities matrix
Created similarity func
Calculated chunk size for multiprocessing
Initialized multiprocessing pool.imap
Killed

real    21m31.634s
user    19m53.965s
sys     1m37.449s

I will check out the other resources that you've listed and report back.

Best, Vini

vinisalazar commented 11 months ago

In the end it took me 800 GB for 1123 metagenomes:

Cores per node: 16
CPU Utilized: 14:03:43
CPU Efficiency: 20.23% of 2-21:29:52 core-walltime
Job Wall-clock time: 04:20:37
Memory Utilized: 792.52 GB
ctb commented 11 months ago

wow! thank you for letting us know!

ctb commented 7 months ago

@kescobo @vinisalazar the branchwater plugin for sourmash now has a pairwise commands (https://github.com/sourmash-bio/sourmash_plugin_branchwater/pull/181) that is both multithreaded and (judging by other benchmarks) likely to be 10-100x less memory than sourmash compare.

It is also now pretty straightforward to install from conda-forge, which is nice :).

I'm pretty sure @mr-eyes has a script to convert its output into a matrix format but I am unable to find it at the moment. Mo? (I don't think it's this one)

Anyway, just wanted to drop by and say this 😆 . We haven't fixed sourmash compare yet, but ...eventually...

mr-eyes commented 7 months ago

@ctb you are correct, I have a script. I have created a PR for it here https://github.com/sourmash-bio/sourmash_plugin_branchwater/pull/198

The PR will be ready after writing tests, though.

mr-eyes commented 7 months ago

For 14,663,820 pairwise comparisons done by branchwater pairwise of 5416 sourmash signatures, the HDF5 conversion is done in 38 seconds and 7.5GB RAM saved to 225M file. I will report the profiling of the exporting code later and update the issue, but it takes a long time and has a low memory footprint (it does not load the whole dense matrix in memory).

vinisalazar commented 7 months ago

That's awesome, thank you @ctb / sourmash team.

mr-eyes commented 7 months ago

but it takes a long time and has a low memory footprint (it does not load the whole dense matrix in memory).

After an enhancement, it takes now 10 seconds to export the TSV dense matrix.

ctb commented 2 months ago

FYI - #3134 contains a bunch of information about how to use the pairwise command from the branchwater plugin, including newly available conversion commands, as well as a check that the results are identical to those output by sourmash compare.