soedinglab / hh-suite

Remote protein homology detection suite.
https://bmcbioinformatics.biomedcentral.com/articles/10.1186/s12859-019-3019-7
GNU General Public License v3.0
515 stars 128 forks source link

Command line to align two MSAs #256

Open rcedgar opened 3 years ago

rcedgar commented 3 years ago

I'm trying to make a global profile-profile alignment of two MSAs in FASTA format using hhalign (hhsuite-3.3.0-SSE2-Linux), i.e. preserving the columns within each MSA and aligning pairs of columns to each other. Input is conventional aligned FASTA with "-" for gap (all sequences same length, all letters upper case, not a2m or a variant). I can't figure out a command line to do this. I'm trying to reproduce a workflow from a 2018 paper where they give an hhalign command line which does not work, possibly because it's an older version of hhalign (they do not state their version). This is a similar command line that gives no errors or warnings:

hhalign -i aln1.fa -t aln2.fa -glob -mact 0 -id 100 -seq 999999 -M 10 -v 0 \   
  -o outname.ali.out \   
  -Ofas outname.ali.fa \   
  -atab outname.ali.tab

However, the alignments in outname.ali.fa are all short local alignments, not global. I found the relevant documentation rather obscure, and trial and error did not succeed -- help will be much appreciated.

milot-mirdita commented 3 years ago

Could you drop the -v 0 to see what hhalign might be complaining about? It would also be very helpful if you could upload the two alignment files, so we can try to reproduce the issue.

rcedgar commented 3 years ago

No errors or warnings AFAICS when -v 0 is removed. I posted files with a runme.bash here:

https://drive5.com/tmp/for_milot.tar.gz

martin-steinegger commented 3 years ago

@rcedgar it looks like your MSAs do not have enough information to produce a good alignment and therefore the mact does not produce a global MSA. This is really not optimal; we will look into it how to improve it.

However, for your case we can increase the alignment quality by just increasing the diversity of the MSA. I searched both MSAs against Uniclust30 (uniclust.mmseqs.com) and the BFD (bfd.mmseqs.com) database to increase the Neff from 2 to 10. By using this diverse MSA it was possible to improve the alignment and e-value of the pair-wise alignment. -glob still causes issues however reducing the -mact to 0.05 resulted in a nearly global alignment (Q: 24-574 T: 2-509). I hope this helps.

Commands used:

hhblits -i one.afa -d uniref30_2020_6 -oa3m one.uniclust.a3m -n 3 -cpu 64  -e 1e-3 -n 3 -p 20 -Z 10000 -z 1 -b 1 -B 10000
hhblits -i two.afa -d uniref30_2020_6 -oa3m two.uniclust.a3m -n 3 -cpu 64  -e 1e-3 -n 3 -p 20 -Z 10000 -z 1 -b 1 -B 10000

hhblits -i one.uniclust.a3m -d bfd_metaclust_clu_complete_id30_c90_final_seq.sorted_opt -oa3m one.bfd.a3m -n 3 -cpu 64
hhblits -i two.uniclust.a3m -d bfd_metaclust_clu_complete_id30_c90_final_seq.sorted_opt -oa3m two.bfd.a3m -n 3 -cpu 64

hhalign -i one.bfd.a3m -t two.bfd.a3m -o one.two.hhr -mact 0.05

HHR output:

Query         A3M#
Match_columns 589
No_of_seqs    181 out of 1087
Neff          8.35449
Searched_HMMs 1
Date          Fri Apr  9 08:44:27 2021
Command       hh-suite/build/src/hhalign -i one.bfd.a3m -t two.bfd.a3m -o one.two.hhr -mact 0.05 

 No Hit                             Prob E-value P-value  Score    SS Cols Query HMM  Template HMM
  1 A3M#                            92.8   4E-07   4E-07   62.3   0.0  446   24-574     2-509 (529)

No 1
>A3M#
Probab=92.79  E-value=4e-07  Score=62.33  Aligned_cols=446  Identities=11%  Similarity=-0.002  Sum_probs=136.5  Template_Neff=12.182

Q YP_009337012.1   24 QWFVDIN---HPEEFGVLDPHIRNNT---WPPPAVIDAFGPRWHLLPLTQCFDIPELSDPSLIYS-------DKSYSMNL   90 (589)
Q Consensus        24 ~wP~~~~---~~~~~~~l~~~i~~~~---~p~~~~~~~~~~dW~~v~~~k~~~~~~~~d~~~~l~-------DKais~~~   90 (589)
                      +-|....   ....+....+++....   .|.........-+...-.+.+.+. ....++..+=.       |+..|+..
T Consensus         2 ~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~-~~~~~~~~~t~~e~~~~~~~~~s~G~   80 (529)
T YP_009126871.1    2 QLERVVSMSMTDTTAFGQQRVFKEKVDTKAPEPPKEVRRVMRLVFTWLVKRIL-AKGGKVRLCTKEEFINKIESHAAIGA   80 (529)
Confidence            0000000   0000111111111110   000000000000000000000000 00000000000       11111111

Q YP_009337012.1   91 SEIVNHLNSDPYRRLESRKVLDALIKLPP--TNFKELLREIDEHGLEKDHLAIGLSAKEREL------KIIGRFFALMTW  162 (589)
Q Consensus        91 ~~~~~~~~~~~~~~~~~rrll~~~L~~~~--~~~~~~l~~i~~~~l~~~~~ii~l~~KErEl------K~~~R~F~~~t~  162 (589)
                      .     +.  .+.....+   ..++..+.  -.+.+.++.+.+|.  ....++....|.+-.      ..++|+|..+..
T Consensus        81 ~-----~~--~~~~~~~~---~~~~~~~~~~~~~~~~~~~~~~~~--~~~~~~~~~~K~E~~~~~~~~~~~~R~i~~~~~  148 (529)
T YP_009126871.1   81 W-----SK--EMESWTSA---REAVNDPMFWNLVSRERELHKKGK--CEMCVYNLMGKREKKPGEYGVAKGSRTIWYMWL  148 (529)
Confidence            0     00  00000111   11111111  12222333332221  112334444443221      237999999887

Q YP_009337012.1  163 KLREYFVFTEYLIKTH-FVP--LFSGLTMSDDHSTLVKKLIECTSGQGIDGCDSIGISNHIDYRKWNNMQRDKATKPTFT  239 (589)
Q Consensus       163 ~~R~~~~~~E~~ia~~-i~~--~~p~~tMt~s~~~l~k~l~~~~~~~~~~~~~~~~~~~~~D~skWn~~~R~e~~~~~~~  239 (589)
                      ..+......-..+.+. +..  ..+.....++..+..+.+..+...        ...+++.|+++|..++..+.....+.
T Consensus       149 ~~~~~~~~~~~~~~~~~~~~~~~~~~~~~g~~~~~~~~~~~~~~~~--------~~~~~~~D~s~~D~~~~~~~~~~~~~  220 (529)
T YP_009126871.1  149 GSRFLEFEAFGFLNEEHWASRKLSGGGVEGVPLAYLGYLLSEMADK--------PGVLYADDTAGWDTRITEADLEDERT  220 (529)
Confidence            7777543333333322 111  123222333333344444443321        13467999999999999998877666

Q YP_009337012.1  240 VMDKFFGYKRVISRTHEFFENSLIYYKARPDWLTV-SQGEVTNRSPEHLGCWQGQAGGLEGLRQKGWSILSLLMI-EEQA  317 (589)
Q Consensus       240 ~ld~lfG~~~~f~~th~~f~~~~i~~~~~~~~~~~-~~~~~~~~~~~~~~~w~g~~GG~EGl~Qk~WTl~t~~~i-~~~~  317 (589)
                      ++.++  .+........++...  +.......|.. ..|...    ...+.-.|..+--+....-+-|++..+++ ..++
T Consensus       221 ~~~~~--~~~~~~~~~~~~~~~--~~~~~~~~~~~~~~G~~~----~~~~~~~~~~~SG~~~T~~~Nti~n~~~~~~~~~  292 (529)
T YP_009126871.1  221 LLDYM--TPEHRLLAQPLFDLT--YMNKVALCPRPYKTGGVV----IDVISRRDQRGSGQVVTYALNTLTNIKVQLIRMA  292 (529)
Confidence            66665  222222222211110  00000000000 001000    00111112222222333444455555442 2333

Q YP_009337012.1  318 DFSNVN---------------------IKVLAQGDNQVICCNYTPRYSRTEEELVNNLRDIVNNNDRLISSVRIGIEKLG  376 (589)
Q Consensus       318 ~~~~~~---------------------~~l~gqGDNqvi~v~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~l~~~~~~~G  376 (589)
                      ...+..                     +.++..|||.++.+.-+        .           ..    .+.+.++++|
T Consensus       293 ~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~GDD~~~~~~~~--------~-----------~~----~~~~~~~~~G  349 (529)
T YP_009126871.1  293 ESEGVLTSELQDNGLRGWLEMHGEDRLTRLLVSGDDCVVNAMDE--------R-----------FS----NALTWLNLMA  349 (529)
Confidence            333332                     37889999999887521        0           11    1234667899

Q YP_009337012.1  377 LVINE------DETMQSADFTNYGKVVIFR-----G----HICCLEEKRYSRVTCSANDQLLTIGSTLSAVSTNCLTVAH  441 (589)
Q Consensus       377 l~lK~------eET~~S~~~~~Y~K~~y~~-----G----~~l~~~lK~~~R~~~~~n~~~psl~~~issi~s~~~sa~~  441 (589)
                      +++|.      .+.+.+-.-..|.++.++.     |    .+....-+.++++.-..+.. .+....             
T Consensus       350 ~~~~~~~~~~~~~~~~~~~~~~Fl~~~~~~~~~~~g~~~~~~~~~~~~~~~~~~~~~~~~-~~~~~~-------------  415 (529)
T YP_009126871.1  350 KVRKDVGQWEPSRGRDDWEEVEFCSNHFHRLTMKDGRELIVPCRDQTELVARACVNQGGS-ADPRAT-------------  415 (529)
Confidence            99994      5677777788888888874     5    22223333333332221111 000000             

Q YP_009337012.1  442 FSKSPIRAMIQYNWLANFVRRVLELYHPGLCSTLKALVSKHTVNPLESVFYKIAYMSLDPSLGGVGGMSYARFHVRAFPD  521 (589)
Q Consensus       442 ~~~~~~~~~~~~~~~~~~~~~~~~~~~p~~~~~~~~~~~~~~~~~~~~~~~l~~ll~~P~~lGG~~~~~~~~~~~Rg~~D  521 (589)
                            ..++.+...+.+..   ...+|..+.-....+..       ..     --..|..-+.+++..-..+..     
T Consensus       416 ------~~~~~~~~~~~~~~---~~~~p~~~~~~~~~~~~-------~~-----~~~~~~~~~~~~~~~~~~~~~-----  469 (529)
T YP_009126871.1  416 ------GCLAKSYAQMWQLL---YFHRRDLRMMSLAIMSA-------VP-----VDWVPTGRTTWSVHAGKEWMT-----  469 (529)
Confidence                  00000000000000   00111110000000000       00     000011001111000000100     

Q YP_009337012.1  522 AVTESLSFWKVIADHTTSEQLRAFAIEMGHPRLLRYEDSHFIKLIEDPSALNI  574 (589)
Q Consensus       522 plt~~l~~l~~~~~~~~~~~l~~~~~~~~~~~~~~~~~~~~~~Ll~DP~sln~  574 (589)
                       -...+..+..++.... +.+..           ...-.+|..|---|.....
T Consensus       470 -~~~~~~~~~~~~~~~~-~~~~~-----------~~~~~~~~~~~~~~~~~~~  509 (529)
T YP_009126871.1  470 -DEDMLEVWNRIWIRDN-PWMDR-----------KDEIDQWSNIPYLPRKVDK  509 (529)
Confidence             0111122222211100 00000           0000011111100000000
rcedgar commented 3 years ago

Great feedback, thanks! Yes, the sequences are highly diverged; I'm attempting to align many pairs of MSAs where the sequences are diverged to something like 5 substitutions per site. However, I have reasons to believe that most these sequences are globally homologous, and the 2018 paper reports a combined global alignment constructed with the help of hhalign without adding more diversity, at least per the incomplete description in their Methods, which I'm trying to reproduce. In that paper, the authors reduced redundancy to 90% identity; they already had more sequences but chose to reduce the number for reasons they do not explain. There are conserved catalytic motifs, and I have verified that these are correctly aligned in most of the sequences. Other methods I've tried fail to align the motifs. So to the best of my knowledge at this point, hhalign is the best/only method for constructing a combined global alignment from the many family-level MSAs. It would be great if you are interested in collaborating on this alignment problem -- by all means contact me directly by email robert@drive5,com.

martin-steinegger commented 3 years ago

@rcedgar from the alignment above it looks like your assumption about the global alignment could be right. We could probably improve the alignment when using more viral sequences to enrich the MSAs. I am interested in helping! I sent you an e-mail (themartinsteingger@gmail.com ).