Less alignments in alignment tsv file than accessions in cluster tsv file

dnieuw commented 4 years ago

Expected Behavior

Using these lines I tried to create a tsv file containing all alignment details for all clustered sequences, including the singletons:

mmseqs cluster sequenceDB resultDb tmp
mmseqs createtsv sequenceDB sequenceDB resultDb resultDb.tsv
mmseqs align sequenceDB sequenceDB resultDb alignDB -a --add-self-matches
mmseqs convertalis sequenceDB sequenceDB alignDB alignDB.tsv

Current Behavior

Looking at the resultDb.tsv file and the alignDB.tsv files I see differences between the number of sequences in the former compared to the latter. The resultDb.tsv contains all parent-child combinations for all sequences that I used to cluster, which is what I expect. However, the alignDB.tsv file contains far less alignment details. At first I thought that the singletons were left out, but that doesn't seem to be the case.

Am I missing something? Or is this a bug?

Steps to Reproduce (for bugs)

mmseqs createdb /mmseqs/examples/DB.fasta DB
mmseqs cluster DB resultDB tmp
mmseqs createtsv DB DB resultDB resultDB.tsv
mmseqs align DB DB resultDB alignDB -a --add-self-matches
mmseqs convertalis DB DB alignDB alignDB.tsv
wc -l alignDB.tsv
> 19924
wc -l resultDB.tsv
> 20000

Environment

MMseqs2 Version: "c5ebe52978a9e944d73bad70cfb538c91309dd8e" Statcs install AVX2 for linux Server supports AVX2, 64 cpu's 500G memory Operating System: CentOS Linux 7 (Core) CPE OS Name: cpe:/o:centos:centos:7 Kernel: Linux 3.10.0-1062.4.1.el7.x86_64 Architecture: x86-64

milot-mirdita commented 4 years ago

If you specify a higher E-value threshold (e.g. align -e 10000) it will contain all 20000 results.

It's still a bit weird why some hits pass through the clustering that don't pass the align E-value cutoff. These seems to be only very short hits. We'll take a closer look at that.

Btw, if you want a set of stickers (see https://twitter.com/thesteinegger/status/1201076220957315074), send me your address to milot at mirdita de.

dnieuw commented 4 years ago

Thank you for your swift response!

Your suggestion indeed fixes the issue for the given example dataset. However, it does not fix the issue for my actual dataset.

I am attempting to cluster all genbank nucleotide sequences from the Picornaviridae viral family. The difference in that dataset is more pronounced, (120881 sequences clustered, 77276 aligned). The total number of sequences is indeed 120881 so all sequences end up in a cluster (or singleton cluster).

I noticed that the default alignment parameters are different for "mmseqs cluster" and "mmseqs align", so I have added those parameters to my commands:

mmseqs createdb Picornaviridae.fasta DB/Picornaviridae
mmseqs cluster DB/Picornaviridae Pico_clust tmp -a --add-self-matches -c 0.8 --alignment-mode 3
mmseqs createtsv DB/Picornaviridae DB/Picornaviridae Pico_clust Pico_clust.tsv
mmseqs align DB/Picornaviridae DB/Picornaviridae Pico_clust Pico_aln -a --add-self-matches -c 0.8 --alignment-mode 3 -e 10000
mmseqs convertalis DB/Picornaviridae DB/Picornaviridae Pico_aln Pico_aln.tsv

This did not fix the issue.

I have extracted all sequences that did not end up in the alignment tsv to check your hypothesis about the size of the sequences. This is the size distribution:

file           format  type  num_seqs     sum_len  min_len  avg_len  max_len
missing.fasta  FASTA   DNA     43,605  70,923,688       42  1,626.5    9,840

So the small size of the fragments does not seem to be the problem. Also, the longest sequence in the original database is 11,057, so there isn't a large discrepancy between the longest sequence in the DB and the largest sequence in the missing.fasta.

As you say, it's a bit weird, I'm quite puzzled by the result...

martin-steinegger commented 4 years ago

@dnieuw actually you encountered two separate issues.

(1) MMseqs2 performs a cascaded clustering (in default in 3 steps). This means that the centroid of a cluster can change with every iteration. So if an centroid changes then it is possible that some members that were already close to the e-value criteria do not fulfill the clustering criteria anymore. However, we do have an mode called --cluster-reassign. This mode removes sequences from the cascaded cluster result that do not fulfill the cluster criteria and reassigns them (if possible) to a different cluster. But this is only supported for protein sequences. This brings us already to issue (2).

(2) It is currently not possible to realign DNA sequences from a clustering. The problem is, we need an anchoring point to perform a banded nucleotide alignment. The clustering format does not contain an anchor. The reason why you actually can realign some of the sequences is because In default we assume diagonal 0.

The current behavior is not good. I will add an error if cluster results are used in the alignment. Unfortunately I do not have any solution for this problem at the moment. I will keep you posted once I come up with something.

dnieuw commented 4 years ago

Thanks @martin-steinegger for the clarification!

Regarding issue 1, would a solution be to not perform cascaded clustering using --single-step-clustering? I'm actually looking to perform hierarchical neighbour-joining clustering of the sequences using steps of decreasing similarity thresholds, so perhaps cascaded clustering is not necessary. Also, does this mean that the result from mmseqs createtsv in my current setup is incorrect for nucleotide sequences?

Regarding issue 2, what would be a (hack) solution to get the alignments? Creating a database from the cluster representatives and mapping the cluster members to that database manually using mmseqs align?

I followed the example you give in the FAQ, so please indicate that this does not lead to good results in case of nucleotide sequences.

martin-steinegger commented 4 years ago

@dnieuw not performing cascaded clustering will lead to a higher hard disk requirement since you need to keep the all against all results on disk. Also for the protein clustering it would slow it down significantly because you need to search with a high sensitivity all against all, while the cascaded increases sensitivity every iteration. The nucleotide clustering should not be affected by the sensitivity only the hard disk issue.

For issue 2, I currently do not have any direct solution. It might be possible to use the normal smith waterman, but it might be very slow. How long are you sequences?

dnieuw commented 4 years ago

The sequences are 10kb max

milot-mirdita commented 4 years ago

I pushed some code to make MMseqs2 use Smith-Waterman instead of the banded nucleotide alignment if it does not have diagonals (i.e. aligning clustering or alignment results instead of prefiltering results). Could you try the latest commit?

This should work pretty well for your use-case, will however probably explode for longer sequences.

milot-mirdita commented 4 years ago

Sorry, nevermind. This is more complicated than I thought :/

dnieuw commented 4 years ago

I'm already grateful for your attempt!

Let me try to simplify the problem. What I am actually looking for is not all perfectly correct alignment characteristics for a cluster, but rather an estimation of the similarity percentage between the cluster representative and the cluster member, and a start and stop position of the cluster members relative to the cluster representative. Just so I have an idea of how strong the similarity "edge" is and where the "query" fits on the "reference".

Would those simplifications make the problem tractable?

milot-mirdita commented 4 years ago

One more idea: The alignments that the clustering is based on still reside in the tmp folder. You can find them in tmp/latest/aln_step1..N. You can call convertalis on those dbs and look up the clustered pairs in them.

Each representative-member pair should exists uniquely (however I am not 100% sure about that). So you should be able to call convertalis on each steps, merge the output files into one and then join the two files with an awk script like this:

awk 'NR == FNR { f[$1$2]="1"; next; } $1$2 in f {print $0 }' clu.tsv concat_aln_res.m8 > clu_aln.m8

(This assumes that repr-member pairs are unique among all alignment steps, might have to do something more complicated, if that's not the case).

dnieuw commented 4 years ago

That is a great idea!

I was already wondering whether I could use the internally created alignments that the clustering is based on.

I've tried your approach on the 98% id clustering tmp directory. I'm not sure if it is normal, but my tmp/latest/ directory contains only a single "step" aln file: aln_step0.1...N. Using convertalis I was able to convert the file to a aln.tsv containing the pairs, which is great.

I'm not sure how to interpret this intermediate file though since I haven't tried understanding the guts of your pipeline ;). Here is an attempt describing it:

The aln.tsv file contains 96,181 pairs, out of 121,045 cluster pairs in the cluster.tsv output file.

96,111 pairs in the aln.tsv file are unique 35 are double, I checked a couple of entries and the alignment results are identical for these duplicates

The cluster.tsv contains 77,031 rep-member pairs, and 44,014 rep-rep singletons The aln.tsv contains 41,864 rep-member pairs, and 54,317 rep-rep singletons

I do not really understand the why, and how of the last merging step you describe with the cluster file since the number of rep-member pairs do not match up.

Thanks for thinking along!

soedinglab / MMseqs2