Open WiseHolo opened 4 years ago
Yes, this is indeed a counter intuitive behavior. The reason is that cascaded clustering has a big effect on the runtime. Linclust performs the following steps:
(1) kmermatcher
<- assign sequences to cluster centroids
(2) rescorediagonal
<- using some fast ungapped alignment between centroids and members
(3) clust
<- clusters the sequences that already passed the alignment criteria and remove them from the remaining set
(4) rescorediagonal
<- remove hits that have a low chance to fulfil the alignment criteria
(5) align
<- align the remaining hits with Gotoh-Smith-Waterman
The more sequences that Linclust filters at an early stage (3) the faster the algorithm runs since the slowest part is the Gotoh-Smith-Waterman algorithm.
Hi, I have a query related to this thread about Linclust runtime for 99% seq id. I have been trying to use Linclust to cluster the entire blast nt database containing 60.6 million nucl seqs with 1TB RAM and 40 threads using the following command:
mmseqs easy-linclust nt.gz db/nt tmpdir --min-seq-id 0.99 -c 0.8 --cov-mode 1 --cluster-mode 2 --threads 40
As you explained above, the job is at the second step of rescorediagonal
for almost 3 days now with only 4 points of progress printed [====
. What is the total runtime that I could expect and is there any way I could speed it up?
Thanks, Manu
Could you give me some more information what you try to cluster? Rescorediagonal
should be very fast and also 60.6 million is not a high amount of sequences. Are the sequences very long?
Here are the exact statistics of the sequences that I am trying to cluster to help understand what's going on.
And here is the log file of mmseqs easy-linclust
after 3000 CPU hours.
What parameters of mmseqs easy-linclust
would be best to cluster such a distribution of sequences at 99% identity?
Thanks for your time.
My assumption is that the long sequences dominates the run time. Linclust was built for short sequences < 100kb and is slow when you try to align genomes against each other.
One trick that might speed up the process would be to use bi-directional coverage --cov-mode 0
. This coverage mode rejects all sequences that cannot fulfil the coverage criteria, which hopefully avoids most of the long running alignments.
Thank you for the quick response.
The running time of Linculst in the threshold between 50% to 90% is normally the same. But why does Linclust‘s running time in the threshold 90% to 100% range gradually increase? The run time at 100% is almost three times that of 90%. Can someone tell me why this happened?