mmseqs cluster cannot correctly handle sequences from different strands

alexzrren commented 3 months ago

Expected Behavior

dataset.zip I have a group of sequences which is properly aligned with almost full length and >95% identity using BLASTN, so this group of sequence have to be clustered into one cluster when using --min-seq-id 0.8 parameter

Current Behavior

My self assembled sequences (startswith SRS-* & known to be on the reverse strand) not properly clustered with public sequences.

Steps to Reproduce (for bugs)

1. Clustering using `easy-linclust` or `easy-cluster` (Not correctly clustered)

mmseqs easy-linclust --min-seq-id 0.8 --cov-mode 1 -c 0.8 orig_seqs.fasta 80ANI_linclust tmp_linclust
mmseqs easy-cluster --min-seq-id 0.8 --cov-mode 1 -c 0.8 orig_seqs.fasta 80ANI_cluster tmp_cluster

2. Checked the result

Below shows the clustering result, according to the description in my Current Behavior, the sequence on the forward and reverse strand not clustered into one cluster, although it known to be very close with high identity.

SRS4197851-k119_5014    SRS4197851-k119_5014
SRS4197851-k119_5014    SRS4197851-k119_5014
SRS4197851-k119_5014    SRS4197849-k141_10057
SRS4197851-k119_5014    SRS4197855-k141_655
SRS4197851-k119_5014    SRS4197862-k141_2526
SRS4197851-k119_5014    SRS4197863-k119_8432
SRS4197851-k119_5014    SRS4197864-k141_4618
SRS4197851-k119_5014    SRS4197865-k141_388
SRS4197851-k119_5014    SRS4197870-k141_8439
SRS4197851-k119_5014    SRS4197873-k141_1626
SRS4197851-k119_5014    SRS4197876-k119_3788
SRS4197851-k119_5014    SRS4197879-k119_11748
SRS4197851-k119_5014    SRS4197880-k119_434
SRS4197851-k119_5014    SRS4197978-k119_13091
SRS4197851-k119_5014    SRS4197981-k141_1571
SRS4197851-k119_5014    SRS4197986-k141_7862
SRS4197851-k119_5014    SRS11964275-k141_133602
MW853947.1      MW853947.1
MW853947.1      MW853947.1
MW853947.1      KU745627.1
MW853947.1      MK378157.1
MW853947.1      MK378172.1
MW853947.1      MK378179.1
MW853947.1      MK378188.1
MW853947.1      MK378206.1
MW853947.1      MK378218.1
MW853947.1      MK378220.1
MW853947.1      MK378224.1
MW853947.1      MN326163.1
MW853947.1      MN326174.1
MW853947.1      ON698674.1

3. Manually convert the sequence into reverse complement

Using seqkit seq function to convert my self-assembled sequence into its reverse complement sequence and keep those public sequence remain original.

cat <(seqkit grep -r -p 'SRS-*' orig_seqs.fasta | seqkit seq -r -p ) <(seqkit grep -v -r -p 'SRS-*' orig_seqs.fasta) > reversed_seqs.fasta

4. Try to cluster the manually processed sequences (Show easy-cluster for instance)

mmseqs easy-cluster --min-seq-id 0.8 --cov-mode 1 -c 0.8 reversed_seqs.fasta 80ANI_cluster_rev tmp_cluster_rev

Then checked the clustered TSV, these sequenced clustered into one cluster

MW853947.1      MW853947.1
MW853947.1      SRS4197851-k119_5014
MW853947.1      SRS4197849-k141_10057
MW853947.1      SRS4197855-k141_655
MW853947.1      SRS4197862-k141_2526
MW853947.1      SRS4197863-k119_8432
MW853947.1      SRS4197864-k141_4618
MW853947.1      SRS4197865-k141_388
MW853947.1      SRS4197870-k141_8439
MW853947.1      SRS4197873-k141_1626
MW853947.1      SRS4197876-k119_3788
MW853947.1      SRS4197879-k119_11748
MW853947.1      SRS4197880-k119_434
MW853947.1      SRS4197978-k119_13091
MW853947.1      SRS4197981-k141_1571
MW853947.1      SRS4197986-k141_7862
MW853947.1      SRS11964275-k141_133602
MW853947.1      KU745627.1
MW853947.1      MK378157.1
MW853947.1      MK378172.1
MW853947.1      MK378179.1
MW853947.1      MK378188.1
MW853947.1      MK378206.1
MW853947.1      MK378218.1
MW853947.1      MK378220.1
MW853947.1      MK378224.1
MW853947.1      MN326163.1
MW853947.1      MN326174.1
MW853947.1      ON698674.1

MMseqs Output (for bugs)

Please kindly refer Steps to Reproduce

Context

NA

Your Environment

Which MMseqs version was used: 15.6f452 (conda installed)
Server specifications: Intel Processor with SSE4.2 AVX2 support w/ 64GB RAM
Operating system and version: CentOS Linux release 7.8.2003 (Core)

milot-mirdita commented 3 months ago

We are aware if this issue and are developing a fix.

You can work around this issue in the nucleotide search/clustering by disabling spaced k-mers with --spaced-kmer-mode 0.

alexzrren commented 3 months ago

This parameter does not work for this issue. I rerun the clustering on the same dataset, and the result remains the same.

milot-mirdita commented 3 months ago

We ran the following command:

mmseqs easy-cluster orig_seqs.fasta 80ANI_cluster_nospace tmp --spaced-kmer-mode 0 --min-seq-id 0.8 --cov-mode 1 -c 0.8

And it looks fine. Could you please post the whole log of the new run?

soedinglab / MMseqs2