soedinglab / hh-suite

Remote protein homology detection suite.
https://bmcbioinformatics.biomedcentral.com/articles/10.1186/s12859-019-3019-7
GNU General Public License v3.0
544 stars 134 forks source link

Both HHBlits and HHSearch give misaligned indels for homologous sequences #363

Open hrp1000 opened 11 months ago

hrp1000 commented 11 months ago

I put one chain from a PDB into my library, then run either HHBLits or HHSearch against another homologous chain with indels and the indels do not align between query and target.

Expected Behavior - indels should align

Current Behavior - indels do not align and sequence identity lower than it "obviously" would be if the indels aligned. NCBI Blast gives 97.37% sequence ID (the indels are in the right place), HHBlits says 88%.

Steps to Reproduce (for bugs)

Put sequence of chain C from 5vol into the library, run query of chain A from 5vol against it. Chain C has a leading PW at the N-terminus, and an indel from 184-190 of QGAVPAD. Chain A has a G at the C-terminus. Otherwise in all respects the two chains have 100% sequence identity.

command to run:

/bmm/soft/linux64/src/hh-suite-bin/bin/hhblits -n 1 -i /bmm/www/servers/phyre2/test/hmm/testc7xrt//c5volA.hhblits.hhm -d /bmm/www/servers/phyre2/test/hmm/full -o /bmm/www/servers/phyre2/test/hmm/testc7xrt//c5volA.hhblits.hhr -b 100 -norealign -z 500 -alt 1 -aliw 60

HH-suite Output (for bugs)

see attached file, but the interesting bit is here - note the indel for c5volC (target) appears around residues 168-174, but in the query (c5volA) appears around 196-202

Q ss_dssp CCSGGGEEEEEETHHHHHHHHHHHHTTTTCSEEEEESCCSSCCCCTTSHHHHHHHHHHHT Q sspred ccchhheeecccchhHHHHHHHHhhcccccceeeeeccccCccCccccccccccccCCCC Q c5volA 121 IGDRQHRAIAGLSMGGGGATNYGQRHSDMFCAVYAMSALMSIPEDPNSKIAILTRSVIEN 180 (260) Q Consensus 121 ~~g~s~g~a~~~~~~~~~~~ 180 (260) ..+..++.+.|.|.|+..+...+...+..+..++..++...................... T Consensus 123 ~~G~S~Gga~~~~~~~~~~~ 182 (268) T c5volC_ 123 IGDRQHRAIAGLSMGGGGATNYGQRHSDMFCAVYAMSALMSIPEQGAVPADDPNSKIAIL 182 (268) T ss_dssp CCSGGGEEEEEETHHHHHHHHHHHHCTTTCSEEEEESCCSSCCSSC---CCCTTSHHHHH T ss_pred CCCCcccEEEEEccchHHHHHHHHhChHHhHHHhhccccccccccccccccccccCccch

Q ss_dssp CHHHHHHTCCHHHHH-------HHTTSEEEEECCTTCTTHHHHHHHHHHHHHTTCCCEEE Q sspred chHHHHhhcchhhhh-------ccccccccccccccCccchHHHHHHHHHHHCCCcEEEE Q c5volA 181 SCVKYVMEADEDRKA-------DLRSVAWFVDCGDDDFLLDRNIEFYQAMRNAGVPCQFR 233 (260) Q Consensus 181 ~~~-------~~~~~L~g~~ 233 (260) ............... ....+++++.+++.|....++++++++|++.|+++++. T Consensus 183 ~~~~~~~gD~~l~g~~ 242 (268) T c5volC_ 183 TRSVIENSCVKYVMEADEDRKADLRSVAWFVDCGDDDFLLDRNIEFYQAMRNAGVPCQFR 242 (268) T ss_dssp HHHHHHTCHHHHHHTCCHHHHHHHTTSEEEEECCTTCTTHHHHHHHHHHHHHTTCCCEEE T ss_pred hHHHHhcCHHHHHHhcChhhhhhccCceEEEEecCchHhHHHHHHHHHHHHHCCCCcEEE

Context

The context is that if a straightforward comparison between two homologous chains appears to give an erroneous alignment, how can I trust it for more complicated alignments with lower sequence identity?

Your Environment

c5volA_.hhblits.txt