Open ctb opened 1 year ago
some things we could do -
loaded
in src/sourmash/commands.py:compare
).@ccbaumler this is a good benchmarking issue! some things to measure, and some things to fix!
I should also say that I'm not 100% sure it was OOM, but it ran for quite some time, then I left for lunch, and when I returned when I returned my whole remote session had crashed (I was using wezterm connect
).
OOM seems like the most likely explanation to me. But if sketches are held in memory, assuming ~80Gb for the matrix itself, and assuming there were some other processes running requiring some memory, 40Gb remaining / 100k sketches means sketches would only have to be on the order of 400Kb each in memory to cause a problem. Is this plausible?
Another idea would be to give the option of reading sketches from disk each time to save memory at the expense of speed, though if you don't have mmaped sketches and have to scan from the beginning each time, this could get real slow.
this might also be a place where we just say ... "hey, have you considered using kspider or gatherly?" cc @mr-eyes
Hi,
I am trying to run sourmash compare
on 1k metagenome signatures (ranging between 5-150MB of size) and I am constantly getting OOM errors. I was going to open a new issue, but came across this one and thought I should comment.
Here are some configurations (RAM and number of processes) I've tried:
-p 1
-p 4
-p 64
Any ideas on what might be happening? Do I just need to throw more memory at it?
Here are some example outputs of my job stats:
State: FAILED (exit code 1)
Cores: 1
CPU Utilized: 00:03:31
CPU Efficiency: 28.75% of 00:12:14 core-walltime
Job Wall-clock time: 00:12:14
Memory Utilized: 78.06 GB
Memory Efficiency: 81.32% of 96.00 GB
State: FAILED (exit code 1)
Nodes: 1
Cores per node: 64
CPU Utilized: 01:58:51
CPU Efficiency: 0.27% of 31-01:40:16 core-walltime
Job Wall-clock time: 11:39:04
Memory Utilized: 253.37 GB
Memory Efficiency: 98.97% of 256.00 GB
Thank you for any assistance you can provide, Vini
ahh, metagenome signatures 😱 . It's not completely surprising to me b/c I would guess that most of the memory usage is being taken up by just loading the sketches into memory. Still, ...suboptimal.
Some details that would be helpful in terms of providing guidance
ls -lS | head -2
and ls -lS | tail -2
sourmash sig summarize
report for them?thanks!
One specific recommendation: raise the scaled value when doing compare, e.g. sourmash compare --scaled 10000 ...
. The results will not be significantly different for metagenomes of this size!
You can also try pyo3_branchwater's multisearch command which should be much more memory efficient but this is still early-stage so, you know, buyer beware 😆 . The output is also not as convenient as sourmash compare's for viz, since it's just a sparse set of places where an above-threshold comparison was found.
Last but not least, I'm not sure where @mr-eyes kSpider is, but it was built for this purpose, so: https://github.com/sourmash-bio/sourmash/issues/2271 and https://dib-lab.github.io/kSpider/
OK, actually last: check out https://github.com/sourmash-bio/sourmash/issues/2735. I'm challenged by the notion of using sourmash compare
for metagenomes. 🤷
kSpider@dev branch has the most recent updates (yet no docs for it), but I am happy to help build/run it until released. Also, I think branchwater/multisearch can tackle this.
The way sourmash compare
works will not be convenient except for a small number of small signatures.
And here's a bonus script for converting branchwater/multisearch results to a Newick file. https://github.com/sourmash-bio/pyo3_branchwater/issues/111
You can also modify the script in https://github.com/sourmash-bio/pyo3_branchwater/issues/111 to generate similar output to sourmash compare. You will simply need to do pandas to_csv
.
Hi @ctb and @mr-eyes, thank you for the detailed answers.
what's the biggest and smallest sketch by file size? (output of ls -lS | head -2 and ls -lS | tail -2
# Biggest file
-rw-r--r-- 1 viniws punim1293 148413087 Sep 12 21:05 SRS2329668_T1.sig
# Smallest file
-rw-r--r-- 1 viniws punim1293 1443785 Sep 12 21:04 SRS954962_T1.sig
for the biggest and smallest sketch file, what does sourmash sig summarize report for them?
(bio) spartan ➜ sourmash sourmash sig summarize SRS2329668_T1.sig
== This is sourmash version 4.8.3. ==
== Please cite Brown and Irber (2016), doi:10.21105/joss.00027. ==
** loading from 'SRS2329668_T1.sig'
path filetype: MultiIndex
location: SRS2329668_T1.sig
is database? no
has manifest? yes
num signatures: 1
** examining manifest...
total hashes: 8530366
summary of sketches:
1 sketches with DNA, k=31, scaled=1000 8530366 total hashes
(bio) spartan ➜ sourmash sourmash sig summarize SRS954962_T1.sig
== This is sourmash version 4.8.3. ==
== Please cite Brown and Irber (2016), doi:10.21105/joss.00027. ==
** loading from 'SRS954962_T1.sig'
path filetype: MultiIndex
location: SRS954962_T1.sig
is database? no
has manifest? yes
num signatures: 1
** examining manifest...
total hashes: 82957
summary of sketches:
1 sketches with DNA, k=31, scaled=1000 82957 total hashes
what's the command line you're using?
time sourmash compare -p 24 --distance-matrix --csv sourmash_compare.csv -o sourmash_compare.out *.sig
== This is sourmash version 4.8.3. ==
== Please cite Brown and Irber (2016), doi:10.21105/joss.00027. ==
loaded 1031 signatures total.
Created memmapped siglist
Initialized memmapped similarities matrix
Created similarity func
Calculated chunk size for multiprocessing
Initialized multiprocessing pool.imap
Killed
real 21m31.634s
user 19m53.965s
sys 1m37.449s
I will check out the other resources that you've listed and report back.
Best, Vini
In the end it took me 800 GB for 1123 metagenomes:
Cores per node: 16
CPU Utilized: 14:03:43
CPU Efficiency: 20.23% of 2-21:29:52 core-walltime
Job Wall-clock time: 04:20:37
Memory Utilized: 792.52 GB
wow! thank you for letting us know!
@kescobo @vinisalazar the branchwater plugin for sourmash now has a pairwise
commands (https://github.com/sourmash-bio/sourmash_plugin_branchwater/pull/181) that is both multithreaded and (judging by other benchmarks) likely to be 10-100x less memory than sourmash compare
.
It is also now pretty straightforward to install from conda-forge, which is nice :).
I'm pretty sure @mr-eyes has a script to convert its output into a matrix format but I am unable to find it at the moment. Mo? (I don't think it's this one)
Anyway, just wanted to drop by and say this 😆 . We haven't fixed sourmash compare yet, but ...eventually...
@ctb you are correct, I have a script. I have created a PR for it here https://github.com/sourmash-bio/sourmash_plugin_branchwater/pull/198
The PR will be ready after writing tests, though.
For 14,663,820 pairwise comparisons done by branchwater pairwise
of 5416 sourmash signatures, the HDF5 conversion is done in 38 seconds
and 7.5GB
RAM saved to 225M
file. I will report the profiling of the exporting code later and update the issue, but it takes a long time and has a low memory footprint (it does not load the whole dense matrix in memory).
That's awesome, thank you @ctb / sourmash team.
but it takes a long time and has a low memory footprint (it does not load the whole dense matrix in memory).
After an enhancement, it takes now 10 seconds to export the TSV dense matrix.
FYI - #3134 contains a bunch of information about how to use the pairwise
command from the branchwater plugin, including newly available conversion commands, as well as a check that the results are identical to those output by sourmash compare
.
@kescobo ran into some out-of-memory errors when trying to do
sourmash compare
on 100k genomes, and we became curious about memory usage 😆kevin:
me:
me:
kevin:
me:
Kevin:
me: