soedinglab / MMseqs2

MMseqs2: ultra fast and sensitive search and clustering suite
https://mmseqs.com
GNU General Public License v3.0
1.4k stars 194 forks source link

What does `pairaln` do? #628

Open ZiyaoLi opened 2 years ago

ZiyaoLi commented 2 years ago

Expected Behavior

More docs about the pairaln module, which seems to be an important module for the MSA pairing of AlphaFold Multimer.

Current Behavior

I cannot find any descriptions about what this module do.

Does it pair any given sequences? Or it just simply extracts the species descriptions in UniRef30 (wrt the MSA pairing demand in ColabFold)? If this is the case, the naming can be a little bit confusing.

gieses commented 1 year ago

I was looking for the same info (coming from the colab fold paper). I havent tried running it yet. I believe that given your query db (sequences of a complex), the target db (e.g. uniref), the alignments (sequences vs. uniref) pairaln will create pairings that satisfy the conditions:

The paragraph from the paper:

MSA pairing for complex prediction. A paired MSA helps AlphaFold2 to predict complexes more accurately only if orthologous genes are paired with each other. We followed a similar strategy as Bryant et al.22 to pair sequences according to their taxonomic identifier. For the pairing we search each distinct sequence of a complex against the UniRef100 using the same procedure as described in section 2.2.1. We return only hits that cover all complex proteins within one species and pair only the best hit (smallest E-value) with an alignment that covers the query to at least 50%. The pairing is implemented in the new MMseqs2 module pairaln.

ZiyaoLi commented 1 year ago

Thank you @gieses for the reply!

Just for a context, when I proposed this issue (months ago) I was trying to replace the msa pairing workflow in alphafold. I got stuck when I had to deal with the taxonomy labels. This is particularly tricky when I wanted to use uniref50 instead of 30.

I finally chose to integrate a new pipeline using the monomer msa from mmseqs, and pairing them with alphafold-multimer's python code. To link the searched msas with taxonomy labels, I extracted a map between taxonomy labels and uniref ids from uniref50 myself.