jkjium / contactGroups

4 stars 2 forks source link

ProtSub for MAFFT and setting penalities #1

Closed 000generic closed 3 years ago

000generic commented 3 years ago

Hi!

We are working with 100s of highly divergent proteins in the TRP superfamily (~10-30% sequence identity per family - 9 families total). In researching how to potentially further optimize our alignments in MAFFT - or switch to other tools - I came across your 2021 ProtSub paper. I love the approach and the ProtSub matrix seems useful! I'd like to put it to use, if I can.

MAFFT now allows user-defined matrices but of the linked format:

https://mafft.cbrc.jp/alignment/software/textcomparison.html#userdefinedmatrix Do you happen to have a version of the ProtSub matrix that would work in MAFFT - or is there a tool for conversions from EMBOSS to MAFFT you could recommend?

I was wondering - in your paper you make comparisons to BLOSUM62 but not to BLOSUM45 or BLOSUM30. Is BLOSUM62 generally preferred out of the BLOSUM series for twilight proteins? I've long been confused by what to use, as people often use BLOSUM62 despite sequence identities less than 30 in alignments/MSA generation. Or is it more that tools like MAFFT are optimized for BLOSUM62 in default and working out gap and extension penalties for BLOSUM45 and BLOSUM30 has not been done comprehensively (at least in the literature). And it's hard to guess - and a lot of work to optimize.

Along these lines, in using the ProtSub matrix in software like MAFFT - do you have any recommendations for setting penalties? Is there a dependency between matrix and penalties? Or is there multiple sequence alignment software you recommend over MAFFT for ProtSub? We have reasonably good machines with 48 or 64 CPU and 500 or 1000 GB RAM to run things locally.

jkjium commented 3 years ago

Hi Eric, Thank you for your interest in ProtSub. I've uploaded a folder "sm2mafft" under contactGroups repository. It includes a simple python script that converts an EMBOSS matrix to a MAFFT matrix, a readme file that describes how to use the script. The script only requires numpy. I hope it can help you.

The choice of matrices for comparing with ProtSub was based on another comparative study. We used the two top-ranked matrices (BLOSUM62 and VTML200). I agree with you that it is hard to know which matrix works the best. In my opinion, there is no absolute "best" matrix. Different matrices may perform well under different circumstances. Finding good criteria to evaluate the result is the most important.

The output alignment is heavily affected by the gap penalties. From my experience, a relatively weaker penalty works better for sequences with low identities. In our study, the best gap penalties for ProtSub are 6 for opening and 2 for extension.

So far as I know, BLAST is also optimized for BLOSUM series matrices. Other than MAFFT, I used MUSCLE to generate MSA. It supports customized matrices in EMBOSS format.

Feel free to contact me here or send me emails :)

000generic commented 3 years ago

Thank you for the script! We've put it to use - and are using ProtSub - still evaluating in MAFFT for now - but initial IQ-TREE trees look good in branch support. Thanks again :)