soedinglab / hh-suite

Remote protein homology detection suite.
https://bmcbioinformatics.biomedcentral.com/articles/10.1186/s12859-019-3019-7
GNU General Public License v3.0
515 stars 128 forks source link

Zero SS scores when using both pdb and pfam #54

Open jamespjh opened 7 years ago

jamespjh commented 7 years ago
hh-suite commit ID  b1aa73afae4811c7
Operating system: linux, compiler: gnu 4.9.2

Hi,

When using both pdb and pfam:

hhsearch -d /home/ucgajhe/levine/databases/pdb70/pdb70 -ssm 4 -cpu 12 -o /home/ucgajhe/Scratch/Levine/results/test_YPR199C/YPR199C.0.ssw11.hhr -i /home/ucgajhe/Scratch/Levine/results/test_YPR199C/YPR199C.0.ss.a3m -v 2 -p 0 -cov 50 -ssw 0.11 -Z 5000 -d /home/ucgajhe/levine/databases/pfamA_30/pfam

we observe zero secondary structure scores for both PDB matches and PFAM matches:

No Hit                             Prob E-value P-value  Score    SS Cols Query HMM  Template HMM
  1 1gd2_E Transcription factor PA  98.7 3.4E-10 6.3E-15   83.8   8.4   68   11-81      1-68  (70)
  2 1sse_B AP-1 like transcription  98.7 2.3E-11 4.2E-16   97.0   0.0   60  235-294    26-85  (86)
  3 1gu4_A CAAT/enhancer binding p  98.1 2.5E-07 4.5E-12   69.3   9.7   61   14-77     11-71  (78)
  4 PF08601.7 ; PAP1 ; Transcripti  97.9   3E-08 5.5E-13   97.1   0.0   56  236-291   298-355 (356)
  5 1hjb_A Ccaat/enhancer binding   97.7   1E-07 1.9E-12   73.4   0.0   61   15-78     12-72  (87)
  6 1ci6_A Transcription factor AT  97.5 2.9E-07 5.3E-12   65.6   0.0   57   18-77      2-58  (63)
  7 PF00170.18 ; bZIP_1 ; bZIP tra  97.5 3.5E-07 6.5E-12   64.3   0.0   58   18-78      5-62  (64)
  8 2dgc_A Protein (GCN4); basic d  97.5 4.4E-07   8E-12   64.4   0.0   57   15-74      6-62  (63)
  9 2wt7_A Proto-oncogene protein   97.4 6.9E-07 1.3E-11   63.0   0.0   57   18-77      2-58  (63)
 10 1jnm_A Proto-oncogene C-JUN; B  97.3 1.2E-06 2.2E-11   61.0   0.0   57   19-78      2-58  (62)
 11 1dh3_A Transcription factor CR  97.2 1.7E-06 3.1E-11   59.9   0.0   52   19-73      2-53  (55)
 12 PF07716.12 ; bZIP_2 ; Basic re  97.1 2.7E-06 4.9E-11   58.3   0.0   51   17-70      4-54  (55)
 13 1t2k_D Cyclic-AMP-dependent tr  97.1 2.8E-06 5.2E-11   59.0   0.0   55   19-76      2-56  (61)
 14 3a5t_A Transcription factor MA  97.1   4E-06 7.2E-11   68.2   0.0   62   18-82     37-98  (107)
 15 2wt7_B Transcription factor MA  97.0 6.5E-06 1.2E-10   64.3   0.0   60   18-80     27-86  (90)
 16 PF03131.14 ; bZIP_Maf ; bZIP M  96.9 9.6E-06 1.8E-10   62.1   0.0   58   18-78     30-87  (90)
 17 5apu_A General control protein  96.5 0.00019 3.5E-09   59.1   3.2   48   19-73     46-93  (95)
 18 2oxj_A Hybrid alpha/beta pepti  94.6   0.029 5.3E-07   38.0   4.4   32   39-73      1-32  (34)
 19 2r2v_A GCN4 leucine zipper; co  94.4   0.036 6.6E-07   38.0   4.2   32   39-73      1-32  (34)
 20 PF16689.2 ; APC_N_CC ; Coiled-  94.2  0.0059 1.1E-07   44.5   0.0   43   39-84      1-43  (52)
 21 4c46_A General control protein  94.0  0.0079 1.4E-07   47.9   0.0   51   18-72     26-76  (76)
 22 1kd8_B GABH BLL, GCN4 acid bas  93.7   0.011   2E-07   41.0   0.0   35   39-76      1-35  (36)
 23 3w92_A Thioester coiled coil p  93.3   0.015 2.8E-07   39.4   0.0   31   40-73      1-31  (32)
 24 1deb_A APC protein, adenomatou  93.3   0.015 2.8E-07   43.2   0.0   43   38-83      2-44  (54)
 25 2wq1_A General control protein  92.6   0.025 4.5E-07   38.6   0.0   31   40-73      1-31  (33)
 26 3c3g_A Alpha/beta peptide with  92.3    0.03 5.4E-07   38.2   0.0   31   40-73      1-31  (33)

but when running with PDB only, we get nonzero scores for all matches.

I note that the PDB database download includes SS data, but PFAM does not:


No 10
>2dgc_A Protein (GCN4); basic domain, leucine zipper, DNA binding, eukaryotic regulatory protein, transcription/DNA complex; HET: DNA; 2.20A {Saccharomyces cerevisiae} SCOP: h.1.3.1 PDB: 1dgc_A* 1ld4_E 1ysa_C* 3p8m_D
Probab=97.47  E-value=4.4e-07  Score=64.37  Aligned_cols=57  Identities=26%  Similarity=0.323  Sum_probs=45.1  Template_Neff=9.500

Q ss_pred             CCCHHHHHHHHHHHHHHHHHHHHHHHHHHHHHHHHHHHHHHHHHHHHHHHHHHHHHHHHH
Q YPR199C          15 LTPPKNKRAAQLRASQNAFRKRKLERLEELEKKEAQLTVTNDQIHILKKENELLHFMLRS   74 (294)
Q Consensus        15 ~~~~k~KRKaQNRaAQkAFRERKE~rlkeLE~kl~ele~~~~~~~~L~~EnE~Lr~~n~e   74 (294)
                      ....+.+|+.+||.||+.+|+||..++.+||.++..|+..+   ..|..+++.|+..+..
T Consensus         6 ~~~~~~~kr~rnr~~~~~~R~rk~~~~~~le~~v~~l~~~~---~~l~~~~~~l~~~~~~   62 (63)
T 2dgc_A            6 SSDPAALKRARNTEAARRSRARKLQRMKQLEDKVEELLSKN---YHLENEVARLKKLVGE   62 (63)
T ss_dssp             -----CHHHHHHHHHHHHHHHHHHHHHHHHHHHHHHHHHHH---HHHHHHHHHHHHC---
T ss_pred             cccHHHHHHHHhHHHHHHHHHHHHHHHHHHHHHHHHHHHHH---HHHHHHHHHHHHHHHh
Confidence            34566778899999999999999999999999999999888   7888888888876654

No 11
>PF00170.19 ; bZIP_1 ; bZIP transcription factor
Probab=97.45  E-value=4.9e-07  Score=63.79  Aligned_cols=58  Identities=28%  Similarity=0.290  Sum_probs=51.1  Template_Neff=9.800

Q ss_pred             HHHHHHHHHHHHHHHHHHHHHHHHHHHHHHHHHHHHHHHHHHHHHHHHHHHHHHHHHHHHH
Q YPR199C          18 PKNKRAAQLRASQNAFRKRKLERLEELEKKEAQLTVTNDQIHILKKENELLHFMLRSLLTE   78 (294)
Q Consensus        18 ~k~KRKaQNRaAQkAFRERKE~rlkeLE~kl~ele~~~~~~~~L~~EnE~Lr~~n~el~~e   78 (294)
                      ++.+|+.+||.||+.||+||..++.+||.++..|+...   ..|..+++.|+..+..|..+
T Consensus         5 k~~rr~~~nr~~~~~~R~rk~~~~~~Le~~~~~L~~~~---~~l~~~~~~l~~e~~~L~~~   62 (64)
T GBF1_ARATH/220    5 KRQKRKQSNRESARRSRLRKQAECEQLQQRVESLSNEN---QSLRDELQRLSSECDKLKSE   62 (64)
Confidence            56788999999999999999999999999999999888   77888888888887776554

We note that secondary structure nonzero matches are found in the web-search tool, but that the downloadable version of hh-pfam does not have any SS info in it.

Most confusing of all, though, is why PDB matches become zero SS score when PFAM is present.

I think this might have something to do with the code in https://github.com/soedinglab/hh-suite/blob/master/src/hhviterbirunner.cpp

   int ss_hmm_mode = HMM::computeScoreSSMode(q_simd->GetHMM(0), t_hmm_simd->GetHMM(0));
    for(size_t i = 1; i < maxres; i++){
        ss_hmm_mode = std::min(ss_hmm_mode,
                               HMM::computeScoreSSMode(q_simd->GetHMM(0), t_hmm_simd->GetHMM(i)));
    }

and this:

https://github.com/soedinglab/hh-suite/blob/master/src/hhhmm.cpp

int HMM::computeScoreSSMode( HMM *q,  HMM *t){
    int returnMode = HMM::NO_SS_INFORMATION;
    if      (q->nss_pred>=0 && t->nss_dssp>=0) returnMode=HMM::PRED_DSSP;
    else if (q->nss_dssp>=0 && t->nss_pred>=0) returnMode=HMM::DSSP_PRED;
    else if (q->nss_pred>=0 && t->nss_pred>=0) returnMode=HMM::PRED_PRED;
    return returnMode;
}

which takes a minimum across the available data, so would result in zero SS for PDB when PFAM is present.

Any thoughts?

meiermark commented 7 years ago

Hello,

thank you for reporting this. This behavior is kind of intended. The internal alignment functions in hhblits are written with simd instructions. This way four templates can be processed simultaneously. However, if one of those templates in such a bulk does not have secondary structure annotation, then no seconary score is calculated in this bulk. The responsible programmer (@martin-steinegger) can probably describe this better.

For your problem we might add secondary structure predictions to our pfam database.

Cheers, Markus

jamespjh commented 7 years ago

Understood. Do you know why we do not encounter this problem when running with both DBs on the web-based tool?

jamespjh commented 7 years ago

It would be great for us if you could add SS to the pFAM: how long might this take?

jamespjh commented 7 years ago

Also, is there an option to turn off SIMD?

meiermark commented 7 years ago

the web-based tool still uses an old version of hhblits; they run two hhblits searches first against the first database, the second search against the second database; afterwards they merge the results in the old hhblits version we did not have the problem with simd instructions and secondary structure scoring

the annotation of ss to the pfam would take a couple of days (< 5 i assume)... i will update the database pipeline, so future releases will also have the secondary structure prediction

there is no option to turn off simd, you can limit the simd instructions to ssse3 (the option is described in the manual).

meiermark commented 7 years ago

forgot to mention: the limitation of the simd instructions has to be done during building with cmake

jamespjh commented 7 years ago

OK, I don't think it's worth us getting a 4-fold slowdown to fix this, so I'll not try the no-SIMD approach. We'll wait until the pfam data is upgraded with SS. Will you let us know when this is ready?

J

meiermark commented 7 years ago

The database is updated

jamespjh commented 7 years ago

Thanks very much

ilectra commented 7 years ago

I get some negative SS values with the new database, is this expected behavior?

meiermark commented 7 years ago

Can you show the alignment with the negative secondary structure scores?

ilectra commented 7 years ago

The file header:

Query         YPR199C Seqment 0
Match_columns 294
No_of_seqs    34 out of 1385
Neff          4.26402
Searched_HMMs 55100
Date          Wed May  3 13:18:19 2017
Command       hhsearch -remove_ss_cap -E 1000000000 -d /home/cceaiac/levine/databases/pdb70/pdb70 -ssm 4 -cpu 1 -o /home/cceaiac/Scratch/Levine/results/test_YPR199C/YPR199C.0.ssw11.hhr -i /home/cceaiac/Scratch/Levine/results/test_YPR199C/YPR199C.0.ss.a3m -v 2 -p 0 -cov 50 -ssw 0.11 -Z 5000 -d /home/cceaiac/levine/databases/pfamA_31/pfam 

One example:

 No Hit                              Prob E-value P-value  Score  SS   Cols Query HMM  Template HMM
   9 PF08601.9 ; PAP1 ; Transcripti  97.9   5E-10 9.1E-15  109.2 -12.6   58  235-292  297-356 (356)

The alignment:

No 9
>PF08601.9 ; PAP1 ; Transcription factor PAP1
Probab=97.86  E-value=5e-10  Score=109.22  Aligned_cols=58  Identities=22%  Similarity=0.399  Sum_probs=54.4  Template_Neff=4.100

Q ss_pred             CCCCeeecHHHHHHHHHhCCccc--CCCHHHHHHHHHHhCccCCCCeeccHHHHHHHHhh
Q YPR199C         235 FGGDVLLSAMDIWSFMKVHPKVN--TFDLEILGTELKKSATCSNFDILISLKHFIKVFSS  292 (294)
Q Consensus       235 ~~g~~lLt~~atWeyi~~~~~~~--~fDv~~v~~kLKg~~~C~g~Gp~~~~~~i~~~~~s  292 (294)
                      ..++.+||+.++|+||..|+.++  +|||+.|+++|+++++|+|+|+||.+.+|+.+|.+
T Consensus       297 ~~~~~lLTcvqaWd~IqshPkF~~gd~DLD~LCseLr~KAKCsGfGaVVee~dVd~iL~k  356 (356)
T Q0CHW7_ASPTN/2  297 EDKTQMLSCTKIWDRLQSMEKFRNGEIDVDNLCSELRTKARCSEGGVVVNQKDVDDIMGR  356 (356)
T ss_pred             cCCCceecHHHHHHHHHhChhhhCCCCCHHHHHHHHhhcCccCCCCCCCCHHHHHHHhcC
Confidence            35789999999999999999998  89999999999999999999999999999998863
meiermark commented 7 years ago

You are right. That is a bug. Can you give us your query?

ilectra commented 7 years ago

I'm attaching a zip file with outputs of every step of the search. The headers should have the information you need. YPR199C_output.zip

Let me know if you need more input!

meiermark commented 7 years ago

Could you please add your input query? At the moment I assume, that you use: http://www.uniprot.org/uniprot/Q676V5.fasta

ilectra commented 7 years ago

A! I think it's

YPR199C Seqment 0 MAKPRGRKGGRKPSLTPPKNKRAAQLRASQNAFRKRKLERLEELEKKEAQLTVTNDQIHILKKENELLHFMLRSLLTERNMPSDERNISKACCEEKPPTCNTLDGSVVLSSTYNSLEIQQCYVFFKQLLSVCVGKNCTVPSPLNSFDRSFYPIGCTNLSNDIPGYSFLNDAMSEIHTFGDFNGELDSTFLEFSGTEIKEPNNFITENTNAIETAAASMVIRQGFHPRQYYTVDAFGGDVLLSAMDIWSFMKVHPKVNTFDLEILGTELKKSATCSNFDILISLKHFIKVFSSKL*

meiermark commented 7 years ago

You should get a warning with your hhsearch call:

Is that true? Where did you get your version of hhblits/hhsearch? What is this parameter supposed to do?

I could reproduce this bug. The responsible programmer will look into this. Thank you for your patience.

martin-steinegger commented 7 years ago

@ilectra I should have fixed the problem with negative score. Please let me know whether the problem persists.

ilectra commented 7 years ago

@martin-steinegger , it did solve the negative SS scores, but there are still some zeros there (these are just the first 20 matches):

Query         YPR199C Seqment 0
Match_columns 294
No_of_seqs    34 out of 1376
Neff          4.26402
Searched_HMMs 52837
Date          Mon May 15 14:12:08 2017
Command       hhsearch -remove_ss_cap -E 1000000000 -d /home/cceaiac/levine/databases/pdb70/pdb70 -ssm 2 -cpu 1 -o /home/cceaiac/Scratch/Levine/results/test_YPR199C/YPR199C.0.ssw11.hhr -i /home/cceaiac/Scratch/Levine/results/test_YPR199C/YPR199C.0.ss.a3m -v 2 -p 0 -cov 50 -ssw 0.11 -Z 5000 -d /home/cceaiac/levine/databases/pfamA_31/pfam

 No Hit                             Prob E-value P-value  Score    SS Cols Query HMM  Template HMM
  1 1sse_B AP-1 like transcription  98.8 2.2E-11 4.3E-16   97.0   5.7   60  235-294    26-85  (86)
  2 1gd2_E Transcription factor PA  98.7 7.5E-10 1.4E-14   81.8  10.3   68   11-81      1-68  (70)
  3 PF08601.9 ; PAP1 ; Transcripti  98.6 4.6E-10 8.8E-15  109.3   6.4   58  235-292   297-356 (356)
  4 1gu4_A CAAT/enhancer binding p  98.2 2.4E-07 4.6E-12   69.2  10.8   61   14-77     11-71  (78)
  5 1ci6_A Transcription factor AT  98.1 2.8E-07 5.3E-12   65.6   9.1   57   18-77      2-58  (63)
  6 PF00170.20 ; bZIP_1 ; bZIP tra  98.1 4.7E-07 8.9E-12   63.8   9.7   58   18-78      5-62  (64)
  7 2dgc_A Protein (GCN4); basic d  98.0 4.2E-07   8E-12   64.4   8.8   57   15-74      6-62  (63)
  8 2wt7_A Proto-oncogene protein   98.0   7E-07 1.3E-11   62.9   9.5   57   18-77      2-58  (63)
  9 1jnm_A Proto-oncogene C-JUN; B  97.9 1.2E-06 2.2E-11   61.0   9.1   57   19-78      2-58  (62)
 10 PF03131.16 ; bZIP_Maf ; bZIP M  97.9 1.6E-06   3E-11   66.8   9.7   59   18-79     30-88  (90)
 11 1t2k_D Cyclic-AMP-dependent tr  97.8 2.9E-06 5.5E-11   58.9   9.1   55   19-76      2-56  (61)
 12 PF07716.14 ; bZIP_2 ; Basic re  97.6 6.7E-06 1.3E-10   56.1   7.5   51   17-70      4-54  (55)
 13 1hjb_A Ccaat/enhancer binding   97.2 2.3E-06 4.4E-11   65.8   0.0   61   15-78     12-72  (87)
 14 1dh3_A Transcription factor CR  96.8 1.1E-05 2.1E-10   55.7   0.0   52   19-73      2-53  (55)
 15 5apu_A General control protein  96.8 0.00018 3.5E-09   59.1   7.0   48   19-73     46-93  (95)
 16 3a5t_A Transcription factor MA  96.8 1.5E-05 2.8E-10   64.7   0.0   62   18-82     37-98  (107)
 17 4c46_A General control protein  95.7  0.0076 1.4E-07   47.9   6.6   51   18-72     26-76  (76)
 18 2wt7_B Transcription factor MA  95.4  0.0011   2E-08   51.8   0.0   60   18-80     27-86  (90)
 19 1deb_A APC protein, adenomatou  95.3   0.018 3.5E-07   42.7   6.1   43   38-83      2-44  (54)
 20 1kd8_B GABH BLL, GCN4 acid bas  95.1   0.015 2.9E-07   40.2   4.5   35   39-76      1-35  (36)
ilectra commented 7 years ago

I should mention that those were not zero before the fix.

ilectra commented 7 years ago

And some of the SS scores are still negative, both when the search is run online, and in my local version - try

>C9orf72

MSTLCPPPSPAVAKTEIALSGKSPLLAATFAYWDNILGPRVRHIWAPKTE
QVLLSDGEITFLANHTLNGEILRNAESGAIDVKFFVLSEKGVIIVSLIFD
GNWNGDRSTYGLSIILPQTELSFYLPLHRVCVDRLTHIIRKGRIWMHKER
QENVQKIILEGTERMEDQGQSIIPMLTGEVIPVMELLSSMKSHSVPEEID
IADTVLNDDDIGDSCHEGFLLNAISSHLQTCGCSVVVGSSAEKVNKIVRT
LCLFLTPAERKCSRLCEAESSFKYESGLFVQGLLKDSTGSFVLPFRQVMY
APYPTTHIDVDVNTVKQMPPCHEHIYNQRRYMRSELTAFWRATSEEDMAQ
DTIIYTDESFTPDLNIFQDVLHRDTLVKAFLDQVFQLKPGLSLRSTFLAQ
FLLVLHRKALTLIKYIEDDTQKGKKPFKSLRNLKIDLDLTAEGDLNIIMA
LAEKIKPGLHSFIFGRPFYTSVQERDVLMTF

Just to make sure we're comparing the same thing, what's the exact software (git tag) and databses versions in the online tool?

ilectra commented 7 years ago

@croth1 , @martin-steinegger , any news on that?

martin-steinegger commented 7 years ago

SS scores can be negative. You could check the SS structure alignment of this negative scoring hits.

The 0 at the SS scoring can still occur when mixing SS types at the target db. (e.g. If some hmms don't have a SS structure or if some have just DSSP and other just Predictions)

I'm currently busy with writing my thesis. I might change the 0 score problem afterwards.