rcedgar / muscle

Multiple sequence and structure alignment with top benchmark scores scalable to thousands of sequences. Generates replicate alignments, enabling assessment of downstream analyses such as trees and predicted structures.
https://drive5.com/muscle
GNU General Public License v3.0
186 stars 21 forks source link

dispersion value meaning #36

Closed Adarsh931 closed 1 year ago

Adarsh931 commented 2 years ago

Dispersion calculation gives me two values, I dont which is the one that is mentioned on the website (and if <0.05 then alignment is likely fine): @disperse file=ensemble_mafft.efa D_LP=0.005485 D_Cols=1

What is D_LP and D_Cols?

rcedgar commented 2 years ago

Yeah, sorry this is quite obscure -- should be better in the output and in the documentation. D_LP is dispersion, from memory I think D_Cols is average column confidence. So the MSAs in this ensemble have very low dispersion and therefore probably have very few errors -- assuming they are actually from muscle5 and not mafft :-)

Adarsh931 commented 2 years ago

Thanks for the answer. Why does it matter how the alignment was done? I mean the concept should remain the same regardless of muscle or MAFFT. Sorry I am bit confused.

On Tue, Jun 14, 2022, 10:42 AM Robert Edgar @.***> wrote:

Yeah, sorry this is quite obscure -- should be better in the output and in the documentation. D_LP is dispersion, from memory I think D_Cols is average column confidence. So the MSAs in this ensemble have very low dispersion and therefore probably have very few errors -- assuming they are actually from muscle5 and not mafft :-)

— Reply to this email directly, view it on GitHub https://github.com/rcedgar/muscle/issues/36#issuecomment-1155289825, or unsubscribe https://github.com/notifications/unsubscribe-auth/AG27XBESLIBDXCX4BSXDYD3VPCK5VANCNFSM5YVUPBOA . You are receiving this because you authored the thread.Message ID: @.***>

rcedgar commented 2 years ago

For dispersion of an ensemble to correlate with error in the alignment, you need a large set of MSAs which are known to have state-of-the-art accuracy on structural benchmarks and which vary as much as possible according to model parameters (gap penalties and substitution matrix) and guide tree, where the variations do not compromise average benchmark accuracy. AFAIK muscle5 is the only algorithm that can do this.

rcedgar commented 2 years ago

Correction -- "MSAs which are known to have state-of-the-art accuracy on structural benchmarks" is wrong, of course in practice we don't have structural benchmarks to compare with. What I mean is, the algorithm used to generate each MSA has high accuracy. The trick is to get many alternative alignments of the same sequences such that they are all equally plausible. If they vary, then this is necessarily due to errors and the number of errors in a typical alignment from the ensemble can therefore be estimated.

Adarsh931 commented 2 years ago

This makes sense. So it means if I generate multiple alignments using MAFFT (say either by running it repeatedly or by using different parameters (like changing the number of iterations in MAFFT)), I can use the dispersion method in muscle to calculate errors in MAFFT alignments and if the dispersion is not too high, then I can just one of the MSA by MAFFT. Am I thinking, right?

On Tue, Jun 14, 2022 at 12:04 PM Robert Edgar @.***> wrote:

Correction -- "MSAs which are known to have state-of-the-art accuracy on structural benchmarks" is wrong, of course in practice we don't have structural benchmarks to compare with. What I mean is, the algorithm used to generate each MSA has high accuracy. The trick is to get many alternative alignments of the same sequences such that they are all equally plausible. If they vary, then this is necessarily due to errors and the number of errors in a typical alignment from the ensemble can therefore be estimated.

— Reply to this email directly, view it on GitHub https://github.com/rcedgar/muscle/issues/36#issuecomment-1155400694, or unsubscribe https://github.com/notifications/unsubscribe-auth/AG27XBA5HT3WCHUW5TZSEPLVPCUQVANCNFSM5YVUPBOA . You are receiving this because you authored the thread.Message ID: @.***>

rcedgar commented 2 years ago

Wrong, sorry. First, muscle5 is more accurate than MAFFT on average. Second, you don't know how to vary MAFFT parameters in the right way -- you need to maximize the parameter changes to vary the alignment without degrading average accuracy on benchmarks. This is very hard to figure out. Even if you do figure it out, you are only varying a small number of parameters such as gap penalties and the number of iterations. This is not sufficient to get enough variation in the ensemble. If you see less variation in MAFFT, this may be explained because you are varying much fewer parameters. In muscle5, roughly 200 parameters are varied including all substitution matrix values and gap penalties, plus the guide tree. It would be a major research project to figure out how to vary a comparable number of parameters in MAFFT, and even if you succeeded the accuracy is lower than muscle5 with MAFFT defaults, so the a priori assumption is that the default muscle5 alignment is better than anything in your MAFFT ensemble.

greg-harhay commented 2 years ago

Hi Robert, I am requesting clarification about the dispersion metrics. I have aligned the spike gene from 192 coronavirus genomes with muscle5 using the -align and - diversified options to create an ensemble of 100 alignments. Using the -disperse option to measure the dispersion in the ensemble yields D_LP=5.066e-06 D_Cols=0.0002444. I tried to dig through the code, but I'm not a C-coder and need a little help interpreting these results. Since these are apparently measures of dispersion, low values in both numbers are desirable I presume, preferably 0, but maybe small numbers are OK. Any suggested thresholds? Any advice about how I could go about breathing some biology into these numbers? Thanks.

rcedgar commented 2 years ago

As noted at the start of this issue, I need to do a better job with the output and documentation here. To answer your question, you breathe biology into this exercise by using the MSA for something. Alignments are a means to an end, what is the end here? Let's say you want to measure the squrgle coefficient of the spike ACE binding domain. Then you do this: calculate the squrgle coefficient S from every MSA and this gives you the mean and standard deviation of S. This tells you the uncertainty in S due to alignment errors.

greg-harhay commented 2 years ago

I'm not familiar with the "squrgle coefficient S". I couldn't find a definition online. Could you provide a definition or links to a definition ? Thanks.

rcedgar commented 2 years ago

:-)) It means whatever you want it to mean -- it was a nonsense word serving as a placeholder for whatever it is you want to infer from an alignment.

greg-harhay commented 2 years ago

Thanks for the clarification.

Adarsh931 commented 1 year ago

Thank you so much for the detailed explanation. It makes sense now.

On Tue, Jun 14, 2022 at 2:15 PM Robert Edgar @.***> wrote:

Wrong, sorry. First, muscle5 is more accurate than MAFFT on average. Second, you don't know how to vary MAFFT parameters in the right way -- you need to maximize the parameter changes to vary the alignment without degrading average accuracy on benchmarks. This is very hard to figure out. Even if you do figure it out, you are only varying a small number of parameters such as gap penalties and the number of iterations. This is not sufficient to get enough variation in the ensemble. If you see less variation in MAFFT, this may be explained because you are varying much fewer parameters. In muscle5, roughly 200 parameters are varied including all substitution matrix values and gap penalties, plus the guide tree. It would be a major research project to figure out how to vary a comparable number of parameters in MAFFT, and even if you succeeded the accuracy is lower than muscle5 with MAFFT defaults, so the a priori assumption is that the default muscle5 alignment is better than anything in your MAFFT ensemble.

— Reply to this email directly, view it on GitHub https://github.com/rcedgar/muscle/issues/36#issuecomment-1155534169, or unsubscribe https://github.com/notifications/unsubscribe-auth/AG27XBFW7225DXQAC3VRYN3VPDD37ANCNFSM5YVUPBOA . You are receiving this because you authored the thread.Message ID: @.***>