rcedgar / muscle

Multiple sequence and structure alignment with top benchmark scores scalable to thousands of sequences. Generates replicate alignments, enabling assessment of downstream analyses such as trees and predicted structures.
https://drive5.com/muscle
GNU General Public License v3.0
186 stars 21 forks source link

alignment error when short sequence #71

Closed orangeSi closed 5 months ago

orangeSi commented 6 months ago

Muscle v3 work well as below:

$echo -e ">q\nAC\n>t\nTTACGG" >test.fa && muscle -in test.fa 

MUSCLE v3.8.1551 by Robert C. Edgar

http://www.drive5.com/muscle
This software is donated to the public domain.
Please cite: Edgar, R.C. Nucleic Acids Res 32(5), 1792-97.

test 2 seqs, lengths min 2, max 6, avg 4
00:00:00     15 MB(2%)  Iter   1  100.00%  K-mer dist pass 1
00:00:00     15 MB(2%)  Iter   1  100.00%  K-mer dist pass 2
00:00:00     16 MB(2%)  Iter   1  100.00%  Align node       
00:00:00     16 MB(2%)  Iter   1  100.00%  Root alignment
>q
--AC--
>t
TTACGG

but Muscle v5 get error alignment as below:

$echo -e ">q\nAC\n>t\nTTACGG" >test.fa && muscle5.1.linux_intel64 -align test.fa -output test.fa.muscle_v5 && cat test.fa.muscle_v5

muscle 5.1.linux64 [12f0e2]  65.2Gb RAM, 8 cores
Built Jan 13 2022 23:17:13
(C) Copyright 2004-2021 Robert C. Edgar.
https://drive5.com

Input: 2 seqs, avg length 4, max 6

00:00 4.0Mb  CPU has 8 cores, running 8 threads
00:00 63Mb    100.0% Calc posteriors
00:00 63Mb    100.0% UPGMA5         
>q
----AC
>t
TTACGG

my linux system info is :

$cat /etc/os-release 
NAME="Ubuntu"
VERSION="20.04.4 LTS (Focal Fossa)"
ID=ubuntu
ID_LIKE=debian
PRETTY_NAME="Ubuntu 20.04.4 LTS"
VERSION_ID="20.04"
HOME_URL="https://www.ubuntu.com/"
SUPPORT_URL="https://help.ubuntu.com/"
BUG_REPORT_URL="https://bugs.launchpad.net/ubuntu/"
PRIVACY_POLICY_URL="https://www.ubuntu.com/legal/terms-and-policies/privacy-policy"
VERSION_CODENAME=focal
UBUNTU_CODENAME=focal
orangeSi commented 5 months ago

@rcedgar

rcedgar commented 5 months ago

I would not regard this as an error. All alignment algorithms generate results that seem anomalous compared to what you would do as a human. Your example here is explained by gap penalties. In MUSCLE as in many other algorithms a gap is penalized by a so-called affine score B + Le where B=gap open penalty, L=gap length and e=gap extension (i.e. length) penalty. B is larger than e, e.g. in BLASTP B=11, e=1. In your example, the "correct" alignment where the AC pair is aligned has lower score because you need to open two gaps, with total penalty 24 using the BLASTP values, this is a higher penalty than 3 from the mismatches AC=0 + CG=-3 (BLOSUM62 values). In MUSCLE the HMM similarly prefers one gap with the mismatches over opening two gaps to get the matches. With longer sequences, this kind of thing happens much less often because the accumulated high-scoring matches overcome the tendency to prefer one terminal gap. The result in v3 is "correct" because it handles terminal gaps differently. I'll bet it is possible to find a counter-example where v5 produces a better result than v3 when there is a long internal gap.

orangeSi commented 5 months ago

ok, thanks~ reason is the gap penalties. cound you explain the handles terminal gaps differently between muscle v3 and v5? because I find v5 result alaways have amino acid at the two terminal of protein sequence(start or end), like ----AC.

rcedgar commented 5 months ago

The algorithms are very different. Muscle v3 uses pretty conventional Needleman-Wunsch with terminal gaps penalized lower. Muscle v5 uses a pair HMM similar to PROBCONS. The pair HMM is not symmetrical between start and end states, hence the preference for opening the gap at the start. In some other programs, you see similar asymmetry effects in N-W because there can be ties in the traceback matrix when >1 alignment has the highest possible score, ties are usually broken using a systematic (and unsymmetrical) rule.

orangeSi commented 5 months ago

ok, thanks~