Open rcedgar opened 3 years ago
Could you drop the -v 0
to see what hhalign
might be complaining about? It would also be very helpful if you could upload the two alignment files, so we can try to reproduce the issue.
No errors or warnings AFAICS when -v 0
is removed. I posted files with a runme.bash
here:
@rcedgar it looks like your MSAs do not have enough information to produce a good alignment and therefore the mact does not produce a global MSA. This is really not optimal; we will look into it how to improve it.
However, for your case we can increase the alignment quality by just increasing the diversity of the MSA. I searched both MSAs against Uniclust30 (uniclust.mmseqs.com) and the BFD (bfd.mmseqs.com) database to increase the Neff from 2 to 10. By using this diverse MSA it was possible to improve the alignment and e-value of the pair-wise alignment. -glob
still causes issues however reducing the -mact
to 0.05 resulted in a nearly global alignment (Q: 24-574 T: 2-509). I hope this helps.
Commands used:
hhblits -i one.afa -d uniref30_2020_6 -oa3m one.uniclust.a3m -n 3 -cpu 64 -e 1e-3 -n 3 -p 20 -Z 10000 -z 1 -b 1 -B 10000
hhblits -i two.afa -d uniref30_2020_6 -oa3m two.uniclust.a3m -n 3 -cpu 64 -e 1e-3 -n 3 -p 20 -Z 10000 -z 1 -b 1 -B 10000
hhblits -i one.uniclust.a3m -d bfd_metaclust_clu_complete_id30_c90_final_seq.sorted_opt -oa3m one.bfd.a3m -n 3 -cpu 64
hhblits -i two.uniclust.a3m -d bfd_metaclust_clu_complete_id30_c90_final_seq.sorted_opt -oa3m two.bfd.a3m -n 3 -cpu 64
hhalign -i one.bfd.a3m -t two.bfd.a3m -o one.two.hhr -mact 0.05
HHR output:
Query A3M#
Match_columns 589
No_of_seqs 181 out of 1087
Neff 8.35449
Searched_HMMs 1
Date Fri Apr 9 08:44:27 2021
Command hh-suite/build/src/hhalign -i one.bfd.a3m -t two.bfd.a3m -o one.two.hhr -mact 0.05
No Hit Prob E-value P-value Score SS Cols Query HMM Template HMM
1 A3M# 92.8 4E-07 4E-07 62.3 0.0 446 24-574 2-509 (529)
No 1
>A3M#
Probab=92.79 E-value=4e-07 Score=62.33 Aligned_cols=446 Identities=11% Similarity=-0.002 Sum_probs=136.5 Template_Neff=12.182
Q YP_009337012.1 24 QWFVDIN---HPEEFGVLDPHIRNNT---WPPPAVIDAFGPRWHLLPLTQCFDIPELSDPSLIYS-------DKSYSMNL 90 (589)
Q Consensus 24 ~wP~~~~---~~~~~~~l~~~i~~~~---~p~~~~~~~~~~dW~~v~~~k~~~~~~~~d~~~~l~-------DKais~~~ 90 (589)
+-|.... ....+....+++.... .|.........-+...-.+.+.+. ....++..+=. |+..|+..
T Consensus 2 ~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~-~~~~~~~~~t~~e~~~~~~~~~s~G~ 80 (529)
T YP_009126871.1 2 QLERVVSMSMTDTTAFGQQRVFKEKVDTKAPEPPKEVRRVMRLVFTWLVKRIL-AKGGKVRLCTKEEFINKIESHAAIGA 80 (529)
Confidence 0000000 0000111111111110 000000000000000000000000 00000000000 11111111
Q YP_009337012.1 91 SEIVNHLNSDPYRRLESRKVLDALIKLPP--TNFKELLREIDEHGLEKDHLAIGLSAKEREL------KIIGRFFALMTW 162 (589)
Q Consensus 91 ~~~~~~~~~~~~~~~~~rrll~~~L~~~~--~~~~~~l~~i~~~~l~~~~~ii~l~~KErEl------K~~~R~F~~~t~ 162 (589)
. +. .+.....+ ..++..+. -.+.+.++.+.+|. ....++....|.+-. ..++|+|..+..
T Consensus 81 ~-----~~--~~~~~~~~---~~~~~~~~~~~~~~~~~~~~~~~~--~~~~~~~~~~K~E~~~~~~~~~~~~R~i~~~~~ 148 (529)
T YP_009126871.1 81 W-----SK--EMESWTSA---REAVNDPMFWNLVSRERELHKKGK--CEMCVYNLMGKREKKPGEYGVAKGSRTIWYMWL 148 (529)
Confidence 0 00 00000111 11111111 12222333332221 112334444443221 237999999887
Q YP_009337012.1 163 KLREYFVFTEYLIKTH-FVP--LFSGLTMSDDHSTLVKKLIECTSGQGIDGCDSIGISNHIDYRKWNNMQRDKATKPTFT 239 (589)
Q Consensus 163 ~~R~~~~~~E~~ia~~-i~~--~~p~~tMt~s~~~l~k~l~~~~~~~~~~~~~~~~~~~~~D~skWn~~~R~e~~~~~~~ 239 (589)
..+......-..+.+. +.. ..+.....++..+..+.+..+... ...+++.|+++|..++..+.....+.
T Consensus 149 ~~~~~~~~~~~~~~~~~~~~~~~~~~~~~g~~~~~~~~~~~~~~~~--------~~~~~~~D~s~~D~~~~~~~~~~~~~ 220 (529)
T YP_009126871.1 149 GSRFLEFEAFGFLNEEHWASRKLSGGGVEGVPLAYLGYLLSEMADK--------PGVLYADDTAGWDTRITEADLEDERT 220 (529)
Confidence 7777543333333322 111 123222333333344444443321 13467999999999999998877666
Q YP_009337012.1 240 VMDKFFGYKRVISRTHEFFENSLIYYKARPDWLTV-SQGEVTNRSPEHLGCWQGQAGGLEGLRQKGWSILSLLMI-EEQA 317 (589)
Q Consensus 240 ~ld~lfG~~~~f~~th~~f~~~~i~~~~~~~~~~~-~~~~~~~~~~~~~~~w~g~~GG~EGl~Qk~WTl~t~~~i-~~~~ 317 (589)
++.++ .+........++... +.......|.. ..|... ...+.-.|..+--+....-+-|++..+++ ..++
T Consensus 221 ~~~~~--~~~~~~~~~~~~~~~--~~~~~~~~~~~~~~G~~~----~~~~~~~~~~~SG~~~T~~~Nti~n~~~~~~~~~ 292 (529)
T YP_009126871.1 221 LLDYM--TPEHRLLAQPLFDLT--YMNKVALCPRPYKTGGVV----IDVISRRDQRGSGQVVTYALNTLTNIKVQLIRMA 292 (529)
Confidence 66665 222222222211110 00000000000 001000 00111112222222333444455555442 2333
Q YP_009337012.1 318 DFSNVN---------------------IKVLAQGDNQVICCNYTPRYSRTEEELVNNLRDIVNNNDRLISSVRIGIEKLG 376 (589)
Q Consensus 318 ~~~~~~---------------------~~l~gqGDNqvi~v~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~l~~~~~~~G 376 (589)
...+.. +.++..|||.++.+.-+ . .. .+.+.++++|
T Consensus 293 ~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~GDD~~~~~~~~--------~-----------~~----~~~~~~~~~G 349 (529)
T YP_009126871.1 293 ESEGVLTSELQDNGLRGWLEMHGEDRLTRLLVSGDDCVVNAMDE--------R-----------FS----NALTWLNLMA 349 (529)
Confidence 333332 37889999999887521 0 11 1234667899
Q YP_009337012.1 377 LVINE------DETMQSADFTNYGKVVIFR-----G----HICCLEEKRYSRVTCSANDQLLTIGSTLSAVSTNCLTVAH 441 (589)
Q Consensus 377 l~lK~------eET~~S~~~~~Y~K~~y~~-----G----~~l~~~lK~~~R~~~~~n~~~psl~~~issi~s~~~sa~~ 441 (589)
+++|. .+.+.+-.-..|.++.++. | .+....-+.++++.-..+.. .+....
T Consensus 350 ~~~~~~~~~~~~~~~~~~~~~~Fl~~~~~~~~~~~g~~~~~~~~~~~~~~~~~~~~~~~~-~~~~~~------------- 415 (529)
T YP_009126871.1 350 KVRKDVGQWEPSRGRDDWEEVEFCSNHFHRLTMKDGRELIVPCRDQTELVARACVNQGGS-ADPRAT------------- 415 (529)
Confidence 99994 5677777788888888874 5 22223333333332221111 000000
Q YP_009337012.1 442 FSKSPIRAMIQYNWLANFVRRVLELYHPGLCSTLKALVSKHTVNPLESVFYKIAYMSLDPSLGGVGGMSYARFHVRAFPD 521 (589)
Q Consensus 442 ~~~~~~~~~~~~~~~~~~~~~~~~~~~p~~~~~~~~~~~~~~~~~~~~~~~l~~ll~~P~~lGG~~~~~~~~~~~Rg~~D 521 (589)
..++.+...+.+.. ...+|..+.-....+.. .. --..|..-+.+++..-..+..
T Consensus 416 ------~~~~~~~~~~~~~~---~~~~p~~~~~~~~~~~~-------~~-----~~~~~~~~~~~~~~~~~~~~~----- 469 (529)
T YP_009126871.1 416 ------GCLAKSYAQMWQLL---YFHRRDLRMMSLAIMSA-------VP-----VDWVPTGRTTWSVHAGKEWMT----- 469 (529)
Confidence 00000000000000 00111110000000000 00 000011001111000000100
Q YP_009337012.1 522 AVTESLSFWKVIADHTTSEQLRAFAIEMGHPRLLRYEDSHFIKLIEDPSALNI 574 (589)
Q Consensus 522 plt~~l~~l~~~~~~~~~~~l~~~~~~~~~~~~~~~~~~~~~~Ll~DP~sln~ 574 (589)
-...+..+..++.... +.+.. ...-.+|..|---|.....
T Consensus 470 -~~~~~~~~~~~~~~~~-~~~~~-----------~~~~~~~~~~~~~~~~~~~ 509 (529)
T YP_009126871.1 470 -DEDMLEVWNRIWIRDN-PWMDR-----------KDEIDQWSNIPYLPRKVDK 509 (529)
Confidence 0111122222211100 00000 0000011111100000000
Great feedback, thanks! Yes, the sequences are highly diverged; I'm attempting to align many pairs of MSAs where the sequences are diverged to something like 5 substitutions per site. However, I have reasons to believe that most these sequences are globally homologous, and the 2018 paper reports a combined global alignment constructed with the help of hhalign without adding more diversity, at least per the incomplete description in their Methods, which I'm trying to reproduce. In that paper, the authors reduced redundancy to 90% identity; they already had more sequences but chose to reduce the number for reasons they do not explain. There are conserved catalytic motifs, and I have verified that these are correctly aligned in most of the sequences. Other methods I've tried fail to align the motifs. So to the best of my knowledge at this point, hhalign is the best/only method for constructing a combined global alignment from the many family-level MSAs. It would be great if you are interested in collaborating on this alignment problem -- by all means contact me directly by email robert@drive5,com.
@rcedgar from the alignment above it looks like your assumption about the global alignment could be right. We could probably improve the alignment when using more viral sequences to enrich the MSAs. I am interested in helping! I sent you an e-mail (themartinsteingger@gmail.com ).
I'm trying to make a global profile-profile alignment of two MSAs in FASTA format using hhalign (hhsuite-3.3.0-SSE2-Linux), i.e. preserving the columns within each MSA and aligning pairs of columns to each other. Input is conventional aligned FASTA with "-" for gap (all sequences same length, all letters upper case, not a2m or a variant). I can't figure out a command line to do this. I'm trying to reproduce a workflow from a 2018 paper where they give an hhalign command line which does not work, possibly because it's an older version of hhalign (they do not state their version). This is a similar command line that gives no errors or warnings:
However, the alignments in outname.ali.fa are all short local alignments, not global. I found the relevant documentation rather obscure, and trial and error did not succeed -- help will be much appreciated.