Question in Linclust Running time

soedinglab / MMseqs2

MMseqs2: ultra fast and sensitive search and clustering suite

https://mmseqs.com

GNU General Public License v3.0

1.39k stars 195 forks source link

Question in Linclust Running time #351

Open WiseHolo opened 4 years ago

WiseHolo commented 4 years ago

The running time of Linculst in the threshold between 50% to 90% is normally the same. But why does Linclust‘s running time in the threshold 90% to 100% range gradually increase? The run time at 100% is almost three times that of 90%. Can someone tell me why this happened?

martin-steinegger commented 4 years ago

Yes, this is indeed a counter intuitive behavior. The reason is that cascaded clustering has a big effect on the runtime. Linclust performs the following steps: (1) kmermatcher<- assign sequences to cluster centroids (2) rescorediagonal <- using some fast ungapped alignment between centroids and members (3) clust <- clusters the sequences that already passed the alignment criteria and remove them from the remaining set (4) rescorediagonal <- remove hits that have a low chance to fulfil the alignment criteria (5) align <- align the remaining hits with Gotoh-Smith-Waterman

The more sequences that Linclust filters at an early stage (3) the faster the algorithm runs since the slowest part is the Gotoh-Smith-Waterman algorithm.

manu-script commented 4 years ago

Hi, I have a query related to this thread about Linclust runtime for 99% seq id. I have been trying to use Linclust to cluster the entire blast nt database containing 60.6 million nucl seqs with 1TB RAM and 40 threads using the following command:

mmseqs easy-linclust nt.gz db/nt tmpdir --min-seq-id 0.99 -c 0.8 --cov-mode 1 --cluster-mode 2 --threads 40

As you explained above, the job is at the second step of rescorediagonal for almost 3 days now with only 4 points of progress printed [====. What is the total runtime that I could expect and is there any way I could speed it up?

Thanks, Manu

martin-steinegger commented 4 years ago

Could you give me some more information what you try to cluster? Rescorediagonal should be very fast and also 60.6 million is not a high amount of sequences. Are the sequences very long?

manu-script commented 4 years ago

Here are the exact statistics of the sequences that I am trying to cluster to help understand what's going on.

Number of Nucleotide Sequences: 60,621,169
Sum of the Lengths of all Sequences: 326,476,863,573 bp
Length of the Shortest Sequence: 6 bp
Length of the Longest Sequence: 99,791,824 bp
Average Length of Sequences: 5,385 bp
Median Length of Sequences: 1,154 bp
25% of Sequences are below: 579 bp
75% of Sequences are below: 2,304 bp
N50 of Sequences: 2,879,031 bp

And here is the log file of mmseqs easy-linclust after 3000 CPU hours.

log.txt

What parameters of mmseqs easy-linclust would be best to cluster such a distribution of sequences at 99% identity?

Thanks for your time.

martin-steinegger commented 4 years ago

My assumption is that the long sequences dominates the run time. Linclust was built for short sequences < 100kb and is slow when you try to align genomes against each other.

One trick that might speed up the process would be to use bi-directional coverage --cov-mode 0. This coverage mode rejects all sequences that cannot fulfil the coverage criteria, which hopefully avoids most of the long running alignments.

manu-script commented 4 years ago

Thank you for the quick response.