Open yuzie0314 opened 7 months ago
hi @yuzie0314 - yes, the Killed
means sourmash used too much memory.
there is a long issue https://github.com/sourmash-bio/sourmash/issues/2299 about this. we are still in a bit of a confused state in terms of recommendations, but the gist of that issue is:
you should be able to use pairwise
and cluster
from the branchwater plugin to do your clustering very quickly and in very low memory.
@bluegenes any words of wisdom here?
Hi @ctb,
I learned that commands pairwise
and cluster
might be useful, but I have some doubts about the results from them. The main products from sourmash compare
is a pairwise ani matrix in csv and a numpy pickle file + label text file, so could we use pairwise
to generate this? Another question is that I am a little confused about cluster
command, what is the main cutoff/ threshold/ metrix to cluster genomes/signatures? is the clustering result based on ANI similarity or not merely rely on ANI? could you help us to clarify this?
This is a really good news to us, since we want to improve the speed of our customed pipelines. Yuzie
Hi @ctb the author,
I used the following command and generated a csv result.
sourmash scripts pairwise --ani \
--cores 16 \
--output ani_matrix.csv \
--scaled 1000000 \
--ksize 21 \
sig.path \
--write-all
The result is different from sourmash compare
method:
Thanks for your help, Yuzie
hi @yuzie0314, yep, the output of pairwise
is a different format - this is because the numpy matrix format is not a sparse matrix, and for large comparisons/large collections it will have a lot of zeroes in it. In this case the CSV format is better, because it only stores pairs where there is a match.
Please see https://github.com/sourmash-bio/sourmash_plugin_branchwater/pull/198 for a script that converts the CSV file into a numpy matrix. I haven't tried it yet myself, I'm afraid, but if you run into problems please feel free to post here and we'll see what we can do!
cluster
work?here is the basic description from https://github.com/sourmash-bio/sourmash_plugin_branchwater/pull/234 -
cluster
usesrustworkx-core
(which internally usespetgraph
) to build a graph, adding edges between nodes when the similarity exceeds the user-defined threshold. It can work on any of the similarity columns output bypairwise
ormultisearch
, and will add all nodes to the graph to preserve singleton 'clusters' in the output.
from the docs -
cluster takes a --similarity_column argument to specify which of the similarity columns, with the following choices: containment, max_containment, jaccard, average_containment_ani, maximum_containment_ani. All values should be input as fractions (e.g. 0.9 for 90%)
per @mr-eyes in https://github.com/sourmash-bio/sourmash_plugin_branchwater/issues/252,
- Community Detection: The current clustering algorithm is
weakly_connected_component,
@bluegenes tried it before with kSpider, and -as far as I remember- it did a great job in the ANI-based clustering of the GTDB-207. Here, I propose adopting community detection methods, which have been proven very useful in DBRetina, but I haven't tried them on DNA data.
we have a script here that can build an ordination view (MDS) from a matrix, and, with some small edits, should work on the output of pairwise
- lmk if this is something you would be interested in.
what's your end goal? then we can see if we can help you get there!
and/or something else?
hi @yuzie0314 I got inspired by your question (and also by some of my own research needs ;)) and built a plugin that I think will help you - see https://github.com/sourmash-bio/sourmash_plugin_betterplot/.
Specifically:
sourmash scripts pairwise_to_comparison
will convert the CSV output of pairwise
to standard sourmash compare
format. So you can do fast comparisons with pairwise
and then turn them into regular matrices.--cut-line
and cluster output functionality :)mds
commands...If you have suggestions or requests for further functionality, please let me know! It's easy and fun to add new stuff to this plugin!
wow, a little complicated answers, I would need some time to test on my environment, but thanks for your contribution again, really helpful @ctb.
I think the solutions you provided is worth us to explore. Will update you once we have any doubts and good news.
hi @yuzie0314 I got inspired by your question (and also by some of my own research needs ;)) and built a plugin that I think will help you - see https://github.com/sourmash-bio/sourmash_plugin_betterplot/.
Specifically:
- the command
sourmash scripts pairwise_to_comparison
will convert the CSV output ofpairwise
to standardsourmash compare
format. So you can do fast comparisons withpairwise
and then turn them into regular matrices.- you might also be interested in the
--cut-line
and cluster output functionality :)- I mean I'm also kind of happy with the
mds
commands...If you have suggestions or requests for further functionality, please let me know! It's easy and fun to add new stuff to this plugin!
Hi @ctb,
I had a quick test on your command sourmash compare
and sourmash scripts pairwise --ani
combined with sourmash scripts pairwise_to_compare
using four metagenomes.
The results from sourmash compare
command alone:
The results from sourmash scripts pairwise --ani
combined with sourmash scripts pairwise_to_compare
:
As you can notice that compare
would outputs the pair-wise ani comparison results even the ani score is low, but for the new combination commands we would only obtain partial information, which would be a little hard for us to interpret the results from the sourmash scripts pairwise
. We also tried to tweak the -t 0
and expected that whether sourmash would greedily show the results between different comparison, but the results were all the same as we didn't assigned any threshold.
The commands we used:
1. sourmash compare -p 8 --ani -o ani_matrix.numpy --csv ani_matrix.csv cluster_mash/*
2. sourmash scripts pairwise --ani \
--cores 16 \
--output ani_pairwise.csv \
--scaled 1000000 \
--ksize 21 \
sig.path \
--write-all \
-t 0
3. sourmash scripts pairwise_to_compare ani_pairwise.csv -o ani_matrix.numpy
That's a great test! You should get identical results (although I will confess I have not tried it myself). I will try it out on my own set of data, but - I'm curious - why set --scaled 1000000
with pairwise
? That would be my main guess as to why you're getting different results.
ohh ya,
I only wated to test whether adding --scaled 1000000
would change the results or not.
BTW, I also tested the command that dropped off this flag, the results are still different the original one.
The row and column are in the same order so we can clearly observed that the ani values in pairwise
command are much lower than compare
command.
Is there anything I missing? just let me know I would like to test in my environment.
The commands we used:
1. sourmash scripts pairwise --ani \
--cores 16 \
--output ani_pairwise.csv \
--ksize 21 \
sig.path \
--write-all \
-t 0
2. sourmash scripts pairwise_to_compare ani_pairwise.csv -o ani_matrix.numpy
I'll have to take a look. Have you compared the Jaccard index or containment matrices, rather than the ANI? I'm wondering if there's a difference in the ANI calculations - then Jaccard/containment would be the same, but ANI would be different. (Which would be a bug, just to be clear!)
Sorry for the late, we were only focusing on the ani comparison results, and didn't check other methods.
Is containment
metric similar to ani? because from your document I learned that you have a containment ani column in gather
results.
I am a little bit worried that the ANI numbers in sourmash_plugin_branchwater are incorrect - we are seeing differences in sourmash gather
output per https://github.com/sourmash-bio/sourmash_plugin_branchwater/issues/331#issuecomment-2168882104. Sorry, just put 2+2 together and realized it may be an issue here! Will investigate and report back.
hi! I did some quick validation on a subset of Thermotoga
genomes I had lying around.
sourmash compare --ani --containment
% sourmash compare /tmp/thermo.sig.zip -k 31 --ani --containment --labels-to /tmp/thermo.labels.csv
== This is sourmash version 4.8.10.dev0. ==
== Please cite Brown and Irber (2016), doi:10.21105/joss.00027. ==
loaded 3 signatures total.
0-CP000916.1 Ther... [1. 0.915 0.925]
1-CP000702.1 Ther... [0.914 1. 0.981]
2-CP000969.1 Ther... [0.925 0.982 1. ]
sourmash scripts multisearch /tmp/thermo.sig.zip /tmp/thermo.sig.zip -k 31 -o /tmp/thermo.multisearch.csv --ani
so the numbers look pretty similar.
I'm using multisearch here because pairwise doesn't return self-comparisons.
But, when I use pairwise, the numbers hold up:
aaaand... I think I've figured it out 😭 . The betterplot plugin command pairwise_to_matrix
only pulls out the Jaccard column, not the ANI column.
Sigh. OK, I added the -u/--use-column
to pairwise_to_matrix
in the latest version of betterplot (which you can install with pip install -U sourmash_plugin_betterplot
.
Now, when I run:
sourmash scripts pairwise thermo.sig.zip -k 31 -o thermo.pairwise.csv --ani --write-all
sourmash scripts pairwise_to_matrix thermo.pairwise.csv -o thermo.pairwise.mat -u average_containment_ani
and then look at the matrix:
>>> import numpy
>>> numpy.load('thermo.pairwise.mat')
array([[1. , 0.91432939, 0.98119423],
[0.91432939, 1. , 0.9250674 ],
[0.98119423, 0.9250674 , 1. ]])
it all looks good.
Apologies for this misunderstanding!
ok, I sat down and did a much more thorough evaluation over in https://github.com/sourmash-bio/sourmash_plugin_branchwater/pull/366, for the average_containment_ani
column.
This comparison generated sourmash compare --ani --avg-containment
results and compared them to the average_containment
column output by sourmash scripts pairwise
on the same data.
They are identical 🎉
I'll close this issue when I add a mention of the pairwise
command to the sourmash documentation. Happy to take more questions here or in a new issue!
Hi @ctb sorry for the late,
I tested on your command several times, which works fine in sourmash scripts pairwise
command, but I found that when I used the same command namely sourmash scripts pairwise_to_matrix ani_pairwise_u.csv -o ani_matrix_u.numpy -u average_containment_ani
the results are still the jacarrd? and I think I've update the package you asked me to update pip install -U sourmash_plugin_betterplot
.
my command :
sourmash scripts pairwise --cores 16 --ksize 31 --output ani_pairwise_u.csv --ani --write-all sig.path
sourmash scripts pairwise_to_matrix ani_pairwise_u.csv -o ani_matrix_u.numpy -u **average_containment_ani**
The results of the original command sourmash compare -p 16 -k 31 --ani -o ani_matrix.numpy --csv ani_matrix.csv cluster_mash/*
:
In addition, I also checked their labels are in the same order....so what I could image the reasons are:
sourmash scripts pairwis
and sourmash compare
are not always the samesourmash scripts pairwise_to_matrix
still using the jacarrd values in sourmash scripts pairwis
.but chances are that I think the main reason might be the former, because I also wrote a very ugly py code to transfer the results from pairwise
to np.matrix....got the same results as sourmash scripts pairwise_to_matrix
.
My comparisons are quite large, it's around 15,000 near completed genomes.
If you want me to provide a subset of our data, please let me know.
P.S. The version in my singularity image: sourmash: 0.8.11 branchwater: 0.9.7 sourmash-plugin-betterplot: 0.4.4
Hi @ctb the sourmash author,
Currently we are working on using your tool to find the representative MAGs within a customed data set assembled from several deeply sequenced stool shotgun samples. Those MAGs actually are classified as the same family level using gtdbtk reference genomes. We have up to 14,500 genomes in this data set, and we want to compute an ani pair-wise matrix using the following command.
sourmash compare -p 8 -k 31 --ani -o ani_matrix.numpy --csv ani_matrix.csv cluster_mash/*.sig
However, after around 1 hr processing, we got a weird error called BrokenPipeError, so we started to think if there is any limitation when using sourmash compare to generate an ani matrix. I think this kind of error is dereived from out off memory, correct me if I am wrong.
P.S. We are using 16 cores and 32 Gb ram, aws EC2 Linux. we also saw a message called
Killedison for index 886 done in 9.36945 seconds
, which might be another reason why this error happend. Current version is v4.8.2 sourmash.