Is it reasonable to use LINCLUST with --Kmer-Per-Seq 2000?

actledge commented 5 months ago

Since cluster module needs too much memory. (I have 2 million nucleotide seqs, about 30G, and 1T memory, and segment fault occurred).

I try to use LINCLUST instead. But I also want a better performance of clustering. I try to increase the --kmer-per-seq, the number of clusters decreased until about --kmer-per-seq 2000 (My shortest sequences are 2000bp). I think this may indicate that the clustering performance has improved.

I compared results in a 3G test dataset, between "Linclust --kmer-per-seq 2000" and "Cluster"， the number of cluster produced by former is relatively closed to the latter.

But I still wonder if it make sense to set --kmer-per-seq 2000, since the default is only 20.

milot-mirdita commented 5 months ago

There is not that much downside to increasing the kmer-per-seq to 2000. It will slow down linclust somewhat, but for only 2M entries it shouldn't really matter.

You might want to use --kmer-per-seq-scale though. This parameter adds additional k-mers linearly based on the sequence length. By default, this is set to 0.2 for nucleotide sequences.

actledge commented 5 months ago

Does this mean that for a sequence of 2kbp, it would take 400 k-mers, and for a sequence of 20kbp, it would take 4000 k-mers if I set --kmer-per-seq-scale 0.2? My sequences have an average length of about 20kbp, so theoretically, --kmer-per-seq-scale 0.2 might yield results similar to or better than --kmer-per-seq 2000? However, when I tested --kmer-per-seq-scale with values of 0.2, 1, and 20, I found that the number of clusters did not significantly differ from running Linclust with default parameter. The number of clusters was roughly twice as many as when using --kmer-per-seq 2000.

As an example, I examined the largest cluster (containing 19,000 sequences) obtained using --kmer-per-seq 2000 . I inspected the length distribution of sequences within this cluster. The cumulative total sequences counts of the top four sequence lengths was 14,000, as follows [seq count, seq length]: 1274 4413.0, 2985 4411.0, 3397 5350.0, 7109 4409.0. I think most of these sequences should be highly similar, indicating that this clustering is relatively accurate. I took some individual examples from this cluster to examine the results of --kmer-per-seq-scale and found that some nearly identical sequences (same length, 99% identity) were not clustered together when using --kmer-per-seq-scale. Meanwhile the largest cluster of --kmer-per-seq-scale 1 contains only 6341 seqs. As a side note, I set the threshold to 95% identity and 85% coverage.

Overall, from my test results, I'm uncertain if there are issues with --kmer-per-seq-scale. Setting it doesn't seem to increase clustering sensitivity as expected. However, perhaps this is because I don't fully understand its principle, so I'm hoping to consult you on this matter.

There is not that much downside to increasing the kmer-per-seq to 2000. It will slow down linclust somewhat, but for only 2M entries it shouldn't really matter.

You might want to use --kmer-per-seq-scale though. This parameter adds additional k-mers linearly based on the sequence length. By default, this is set to 0.2 for nucleotide sequences.

milot-mirdita commented 5 months ago

This is surprising. For a 2k long sequence it should generate 420 k-mers (20 base + 0.2 * 2000).

Could you post the MMseqs2 terminal output of the different runs here? Maybe that would help us diagnose why --kmer-per-seq-scale underperforms.

actledge commented 5 months ago

These are outputs of each run: linclust_default.txt linclust_kmer2000_clustmode1.txt linclust_kmerscale0.2_clustmode1.txt linclust_kmerscale1_clustmode1.txt linclust_kmerscale20_clustmode1.txt

This is surprising. For a 2k long sequence it should generate 420 k-mers (20 base + 0.2 * 2000).

Could you post the MMseqs2 terminal output of the different runs here? Maybe that would help us diagnose why --kmer-per-seq-scale underperforms.

soedinglab / MMseqs2

Is it reasonable to use LINCLUST with --Kmer-Per-Seq 2000? #831