soedinglab / hh-suite

Remote protein homology detection suite.
https://bmcbioinformatics.biomedcentral.com/articles/10.1186/s12859-019-3019-7
GNU General Public License v3.0
515 stars 128 forks source link

change entry name in custom database #321

Closed Fede112 closed 1 year ago

Fede112 commented 1 year ago

Expected Behavior

I succesfully built a custom database from a set of protein families, each represented by an MSA file (F0001.fas - F0002.fas - etc. ) The problem comes when searching against my custom database. I wanted hits found using hhsearch to be named according to the original MSA files. Instead they are not. The goal is to reproduce the results obtained when searching against the prebuilt pfam35 database:

 No Hit                             Prob E-value P-value  Score    SS Cols Query HMM  Template HMM
  1 PF20147.2 ; Crinkler ; Crinkle  96.0 9.2E-05 4.7E-09   38.7   0.0   45    1-45      3-56  (110)
  2 PF19718.2 ; USP47_C ; Ubiquiti  37.3     6.9 0.00035   23.4   0.0   22   11-32     18-39  (244)
  3 PF10407.12 ; Cytokin_check_N ;  29.7      11 0.00056   18.1   0.0   23   13-35      5-27  (72)
  4 PF08817.13 ; YukD ; WXG100 pro  24.6      16  0.0008   15.0   0.0   16   14-29     15-30  (75)
  5 PF08783.14 ; DWNN ; DWNN domai  17.8      27  0.0014   16.4   0.0   12   21-32     22-33  (74)

This is not a bug but it would be nice to be able to change the target name if needed.

Current Behavior

Using hhsearch against my custom database produces hits named after the first protein (or consensus one?) and not the original family name specified in the respective MSA file.

 No Hit                             Prob E-value P-value  Score    SS Cols Query HMM  Template HMM
  1 A0A1V9YY71|4-100                99.2 3.8E-15 8.2E-20   91.8   0.0   90    1-96     21-126 (180)
  2 Q3ZBQ4|30-76                    94.0  0.0066 1.4E-07   33.9   0.0   22    9-30     45-66  (109)
  3 A0A1R3K2R3|49-250               90.9   0.053 1.1E-06   35.0   0.0   22    9-30     44-65  (234)
  4 D2V8P5|263-661                  89.3   0.099 2.1E-06   39.1   0.0   28   22-53      1-33  (744)
  5 UPI00081137D8|150-376           64.8     2.9 6.2E-05   29.0   0.0   40    7-46     24-63  (366)

Steps to Reproduce (for bugs)

1) Build custom database as detailed in the user guide (section: starting with MSAs)

2) hhsearch -cpu 4 -i query.hhm -d ./databases/custom_db -o query.hhr

Your Environment

Fede112 commented 1 year ago

Found the solution. The issue was in the way hhconsensus annotates the consensus protein. By just adding a comment on top of each MSA I got it working as I wanted to. As an example:

# FAM1231
>Prot1| xx-yy
KDGVSGTSDLKLLGAARARLR

Still it would be nice to be able to choose this from command line.