`sourmash compare` runs out of memory on large comparisons

yuzie0314 commented 7 months ago

Hi @ctb the sourmash author,

Currently we are working on using your tool to find the representative MAGs within a customed data set assembled from several deeply sequenced stool shotgun samples. Those MAGs actually are classified as the same family level using gtdbtk reference genomes. We have up to 14,500 genomes in this data set, and we want to compute an ani pair-wise matrix using the following command.

sourmash compare -p 8 -k 31 --ani -o ani_matrix.numpy --csv ani_matrix.csv cluster_mash/*.sig

However, after around 1 hr processing, we got a weird error called BrokenPipeError, so we started to think if there is any limitation when using sourmash compare to generate an ani matrix. I think this kind of error is dereived from out off memory, correct me if I am wrong.

P.S. We are using 16 cores and 32 Gb ram, aws EC2 Linux. we also saw a message called Killedison for index 886 done in 9.36945 seconds, which might be another reason why this error happend. Current version is v4.8.2 sourmash.

ctb commented 7 months ago

hi @yuzie0314 - yes, the Killed means sourmash used too much memory.

there is a long issue https://github.com/sourmash-bio/sourmash/issues/2299 about this. we are still in a bit of a confused state in terms of recommendations, but the gist of that issue is:

you should be able to use pairwise and cluster from the branchwater plugin to do your clustering very quickly and in very low memory.

@bluegenes any words of wisdom here?

yuzie0314 commented 6 months ago

Hi @ctb, I learned that commands pairwise and cluster might be useful, but I have some doubts about the results from them. The main products from sourmash compare is a pairwise ani matrix in csv and a numpy pickle file + label text file, so could we use pairwise to generate this? Another question is that I am a little confused about cluster command, what is the main cutoff/ threshold/ metrix to cluster genomes/signatures? is the clustering result based on ANI similarity or not merely rely on ANI? could you help us to clarify this?

This is a really good news to us, since we want to improve the speed of our customed pipelines. Yuzie

yuzie0314 commented 6 months ago

Hi @ctb the author,

I used the following command and generated a csv result.

sourmash scripts pairwise --ani \
        --cores 16 \
        --output ani_matrix.csv \
        --scaled 1000000 \
        --ksize 21 \
        sig.path \
        --write-all

The result is different from sourmash compare method:

Thanks for your help, Yuzie

ctb commented 6 months ago

matrix vs CSV output

hi @yuzie0314, yep, the output of pairwise is a different format - this is because the numpy matrix format is not a sparse matrix, and for large comparisons/large collections it will have a lot of zeroes in it. In this case the CSV format is better, because it only stores pairs where there is a match.

Please see https://github.com/sourmash-bio/sourmash_plugin_branchwater/pull/198 for a script that converts the CSV file into a numpy matrix. I haven't tried it yet myself, I'm afraid, but if you run into problems please feel free to post here and we'll see what we can do!

how does `cluster` work?

here is the basic description from https://github.com/sourmash-bio/sourmash_plugin_branchwater/pull/234 -

clusteruses rustworkx-core (which internally uses petgraph) to build a graph, adding edges between nodes when the similarity exceeds the user-defined threshold. It can work on any of the similarity columns output by pairwise or multisearch, and will add all nodes to the graph to preserve singleton 'clusters' in the output.

from the docs -

cluster takes a --similarity_column argument to specify which of the similarity columns, with the following choices: containment, max_containment, jaccard, average_containment_ani, maximum_containment_ani. All values should be input as fractions (e.g. 0.9 for 90%)

per @mr-eyes in https://github.com/sourmash-bio/sourmash_plugin_branchwater/issues/252,

Community Detection: The current clustering algorithm is weakly_connected_component, @bluegenes tried it before with kSpider, and -as far as I remember- it did a great job in the ANI-based clustering of the GTDB-207. Here, I propose adopting community detection methods, which have been proven very useful in DBRetina, but I haven't tried them on DNA data.

ordination

we have a script here that can build an ordination view (MDS) from a matrix, and, with some small edits, should work on the output of pairwise - lmk if this is something you would be interested in.

further questions :)

what's your end goal? then we can see if we can help you get there!

cluster visualization?
and/or cluster extraction?

and/or something else?

ctb commented 6 months ago

hi @yuzie0314 I got inspired by your question (and also by some of my own research needs ;)) and built a plugin that I think will help you - see https://github.com/sourmash-bio/sourmash_plugin_betterplot/.

Specifically:

the command sourmash scripts pairwise_to_comparison will convert the CSV output of pairwise to standard sourmash compare format. So you can do fast comparisons with pairwise and then turn them into regular matrices.
you might also be interested in the --cut-line and cluster output functionality :)
I mean I'm also kind of happy with the mds commands...

If you have suggestions or requests for further functionality, please let me know! It's easy and fun to add new stuff to this plugin!

yuzie0314 commented 6 months ago

wow, a little complicated answers, I would need some time to test on my environment, but thanks for your contribution again, really helpful @ctb.

I think the solutions you provided is worth us to explore. Will update you once we have any doubts and good news.

yuzie0314 commented 6 months ago

hi @yuzie0314 I got inspired by your question (and also by some of my own research needs ;)) and built a plugin that I think will help you - see https://github.com/sourmash-bio/sourmash_plugin_betterplot/.

Specifically:

the command sourmash scripts pairwise_to_comparison will convert the CSV output of pairwise to standard sourmash compare format. So you can do fast comparisons with pairwise and then turn them into regular matrices.

you might also be interested in the --cut-line and cluster output functionality :)

I mean I'm also kind of happy with the mds commands...

If you have suggestions or requests for further functionality, please let me know! It's easy and fun to add new stuff to this plugin!

Hi @ctb, I had a quick test on your command sourmash compare and sourmash scripts pairwise --ani combined with sourmash scripts pairwise_to_compare using four metagenomes.

The results from sourmash compare command alone: 螢幕擷取畫面 2024-05-20 142557

The results from sourmash scripts pairwise --ani combined with sourmash scripts pairwise_to_compare: 螢幕擷取畫面 2024-05-20 142455

As you can notice that compare would outputs the pair-wise ani comparison results even the ani score is low, but for the new combination commands we would only obtain partial information, which would be a little hard for us to interpret the results from the sourmash scripts pairwise. We also tried to tweak the -t 0 and expected that whether sourmash would greedily show the results between different comparison, but the results were all the same as we didn't assigned any threshold.

The commands we used:
1. sourmash compare -p 8 --ani -o ani_matrix.numpy --csv ani_matrix.csv cluster_mash/*

2. sourmash scripts pairwise --ani \
        --cores 16 \
        --output ani_pairwise.csv \
        --scaled 1000000 \
        --ksize 21 \
        sig.path \
        --write-all \
        -t 0

3. sourmash scripts pairwise_to_compare ani_pairwise.csv -o ani_matrix.numpy

ctb commented 6 months ago

That's a great test! You should get identical results (although I will confess I have not tried it myself). I will try it out on my own set of data, but - I'm curious - why set --scaled 1000000 with pairwise? That would be my main guess as to why you're getting different results.

yuzie0314 commented 6 months ago

ohh ya, I only wated to test whether adding --scaled 1000000 would change the results or not. BTW, I also tested the command that dropped off this flag, the results are still different the original one. The row and column are in the same order so we can clearly observed that the ani values in pairwise command are much lower than compare command.

Is there anything I missing? just let me know I would like to test in my environment.

The commands we used:
1. sourmash scripts pairwise --ani \
        --cores 16 \
        --output ani_pairwise.csv \
        --ksize 21 \
        sig.path \
        --write-all \
        -t 0

2. sourmash scripts pairwise_to_compare ani_pairwise.csv -o ani_matrix.numpy

ctb commented 6 months ago

I'll have to take a look. Have you compared the Jaccard index or containment matrices, rather than the ANI? I'm wondering if there's a difference in the ANI calculations - then Jaccard/containment would be the same, but ANI would be different. (Which would be a bug, just to be clear!)

yuzie0314 commented 6 months ago

Sorry for the late, we were only focusing on the ani comparison results, and didn't check other methods. Is containment metric similar to ani? because from your document I learned that you have a containment ani column in gather results.

ctb commented 5 months ago

I am a little bit worried that the ANI numbers in sourmash_plugin_branchwater are incorrect - we are seeing differences in sourmash gather output per https://github.com/sourmash-bio/sourmash_plugin_branchwater/issues/331#issuecomment-2168882104. Sorry, just put 2+2 together and realized it may be an issue here! Will investigate and report back.

ctb commented 5 months ago

hi! I did some quick validation on a subset of Thermotoga genomes I had lying around.

`sourmash compare --ani --containment`

% sourmash compare /tmp/thermo.sig.zip -k 31 --ani --containment --labels-to /tmp/thermo.labels.csv

== This is sourmash version 4.8.10.dev0. ==
== Please cite Brown and Irber (2016), doi:10.21105/joss.00027. ==

loaded 3 signatures total.

0-CP000916.1 Ther...    [1.    0.915 0.925]
1-CP000702.1 Ther...    [0.914 1.    0.981]
2-CP000969.1 Ther...    [0.925 0.982 1.   ]

`sourmash scripts multisearch /tmp/thermo.sig.zip /tmp/thermo.sig.zip -k 31 -o /tmp/thermo.multisearch.csv --ani`

so the numbers look pretty similar.

I'm using multisearch here because pairwise doesn't return self-comparisons.

But, when I use pairwise, the numbers hold up:

aaaand... I think I've figured it out 😭 . The betterplot plugin command pairwise_to_matrix only pulls out the Jaccard column, not the ANI column.

Sigh. OK, I added the -u/--use-column to pairwise_to_matrix in the latest version of betterplot (which you can install with pip install -U sourmash_plugin_betterplot.

Now, when I run:

sourmash scripts pairwise thermo.sig.zip -k 31 -o thermo.pairwise.csv --ani --write-all
sourmash scripts pairwise_to_matrix thermo.pairwise.csv -o thermo.pairwise.mat -u average_containment_ani

and then look at the matrix:

>>> import numpy
>>> numpy.load('thermo.pairwise.mat')
array([[1.        , 0.91432939, 0.98119423],
       [0.91432939, 1.        , 0.9250674 ],
       [0.98119423, 0.9250674 , 1.        ]])

it all looks good.

Apologies for this misunderstanding!

ctb commented 5 months ago

ok, I sat down and did a much more thorough evaluation over in https://github.com/sourmash-bio/sourmash_plugin_branchwater/pull/366, for the average_containment_ani column.

This comparison generated sourmash compare --ani --avg-containment results and compared them to the average_containment column output by sourmash scripts pairwise on the same data.

They are identical 🎉

I'll close this issue when I add a mention of the pairwise command to the sourmash documentation. Happy to take more questions here or in a new issue!

yuzie0314 commented 3 months ago

Hi @ctb sorry for the late, I tested on your command several times, which works fine in sourmash scripts pairwise command, but I found that when I used the same command namely sourmash scripts pairwise_to_matrix ani_pairwise_u.csv -o ani_matrix_u.numpy -u average_containment_ani the results are still the jacarrd? and I think I've update the package you asked me to update pip install -U sourmash_plugin_betterplot.

my command :

sourmash scripts pairwise --cores 16 --ksize 31 --output ani_pairwise_u.csv --ani --write-all sig.path
sourmash scripts pairwise_to_matrix ani_pairwise_u.csv -o ani_matrix_u.numpy -u **average_containment_ani**

螢幕擷取畫面 2024-09-05 161239 The results of the original command sourmash compare -p 16 -k 31 --ani -o ani_matrix.numpy --csv ani_matrix.csv cluster_mash/*:

In addition, I also checked their labels are in the same order....so what I could image the reasons are:

The results from sourmash scripts pairwis and sourmash compare are not always the same
sourmash scripts pairwise_to_matrix still using the jacarrd values in sourmash scripts pairwis.

but chances are that I think the main reason might be the former, because I also wrote a very ugly py code to transfer the results from pairwise to np.matrix....got the same results as sourmash scripts pairwise_to_matrix. My comparisons are quite large, it's around 15,000 near completed genomes. If you want me to provide a subset of our data, please let me know.

P.S. The version in my singularity image: sourmash: 0.8.11 branchwater: 0.9.7 sourmash-plugin-betterplot: 0.4.4

sourmash-bio / sourmash