Open jamespjh opened 7 years ago
Hello,
thank you for reporting this. This behavior is kind of intended. The internal alignment functions in hhblits are written with simd instructions. This way four templates can be processed simultaneously. However, if one of those templates in such a bulk does not have secondary structure annotation, then no seconary score is calculated in this bulk. The responsible programmer (@martin-steinegger) can probably describe this better.
For your problem we might add secondary structure predictions to our pfam database.
Cheers, Markus
Understood. Do you know why we do not encounter this problem when running with both DBs on the web-based tool?
It would be great for us if you could add SS to the pFAM: how long might this take?
Also, is there an option to turn off SIMD?
the web-based tool still uses an old version of hhblits; they run two hhblits searches first against the first database, the second search against the second database; afterwards they merge the results in the old hhblits version we did not have the problem with simd instructions and secondary structure scoring
the annotation of ss to the pfam would take a couple of days (< 5 i assume)... i will update the database pipeline, so future releases will also have the secondary structure prediction
there is no option to turn off simd, you can limit the simd instructions to ssse3 (the option is described in the manual).
forgot to mention: the limitation of the simd instructions has to be done during building with cmake
OK, I don't think it's worth us getting a 4-fold slowdown to fix this, so I'll not try the no-SIMD approach. We'll wait until the pfam data is upgraded with SS. Will you let us know when this is ready?
J
The database is updated
Thanks very much
I get some negative SS values with the new database, is this expected behavior?
Can you show the alignment with the negative secondary structure scores?
The file header:
Query YPR199C Seqment 0
Match_columns 294
No_of_seqs 34 out of 1385
Neff 4.26402
Searched_HMMs 55100
Date Wed May 3 13:18:19 2017
Command hhsearch -remove_ss_cap -E 1000000000 -d /home/cceaiac/levine/databases/pdb70/pdb70 -ssm 4 -cpu 1 -o /home/cceaiac/Scratch/Levine/results/test_YPR199C/YPR199C.0.ssw11.hhr -i /home/cceaiac/Scratch/Levine/results/test_YPR199C/YPR199C.0.ss.a3m -v 2 -p 0 -cov 50 -ssw 0.11 -Z 5000 -d /home/cceaiac/levine/databases/pfamA_31/pfam
One example:
No Hit Prob E-value P-value Score SS Cols Query HMM Template HMM
9 PF08601.9 ; PAP1 ; Transcripti 97.9 5E-10 9.1E-15 109.2 -12.6 58 235-292 297-356 (356)
The alignment:
No 9
>PF08601.9 ; PAP1 ; Transcription factor PAP1
Probab=97.86 E-value=5e-10 Score=109.22 Aligned_cols=58 Identities=22% Similarity=0.399 Sum_probs=54.4 Template_Neff=4.100
Q ss_pred CCCCeeecHHHHHHHHHhCCccc--CCCHHHHHHHHHHhCccCCCCeeccHHHHHHHHhh
Q YPR199C 235 FGGDVLLSAMDIWSFMKVHPKVN--TFDLEILGTELKKSATCSNFDILISLKHFIKVFSS 292 (294)
Q Consensus 235 ~~g~~lLt~~atWeyi~~~~~~~--~fDv~~v~~kLKg~~~C~g~Gp~~~~~~i~~~~~s 292 (294)
..++.+||+.++|+||..|+.++ +|||+.|+++|+++++|+|+|+||.+.+|+.+|.+
T Consensus 297 ~~~~~lLTcvqaWd~IqshPkF~~gd~DLD~LCseLr~KAKCsGfGaVVee~dVd~iL~k 356 (356)
T Q0CHW7_ASPTN/2 297 EDKTQMLSCTKIWDRLQSMEKFRNGEIDVDNLCSELRTKARCSEGGVVVNQKDVDDIMGR 356 (356)
T ss_pred cCCCceecHHHHHHHHHhChhhhCCCCCHHHHHHHHhhcCccCCCCCCCCHHHHHHHhcC
Confidence 35789999999999999999998 89999999999999999999999999999998863
You are right. That is a bug. Can you give us your query?
I'm attaching a zip file with outputs of every step of the search. The headers should have the information you need. YPR199C_output.zip
Let me know if you need more input!
Could you please add your input query? At the moment I assume, that you use: http://www.uniprot.org/uniprot/Q676V5.fasta
A! I think it's
YPR199C Seqment 0 MAKPRGRKGGRKPSLTPPKNKRAAQLRASQNAFRKRKLERLEELEKKEAQLTVTNDQIHILKKENELLHFMLRSLLTERNMPSDERNISKACCEEKPPTCNTLDGSVVLSSTYNSLEIQQCYVFFKQLLSVCVGKNCTVPSPLNSFDRSFYPIGCTNLSNDIPGYSFLNDAMSEIHTFGDFNGELDSTFLEFSGTEIKEPNNFITENTNAIETAAASMVIRQGFHPRQYYTVDAFGGDVLLSAMDIWSFMKVHPKVNTFDLEILGTELKKSATCSNFDILISLKHFIKVFSSKL*
You should get a warning with your hhsearch call:
Is that true? Where did you get your version of hhblits/hhsearch? What is this parameter supposed to do?
I could reproduce this bug. The responsible programmer will look into this. Thank you for your patience.
@ilectra I should have fixed the problem with negative score. Please let me know whether the problem persists.
@martin-steinegger , it did solve the negative SS scores, but there are still some zeros there (these are just the first 20 matches):
Query YPR199C Seqment 0
Match_columns 294
No_of_seqs 34 out of 1376
Neff 4.26402
Searched_HMMs 52837
Date Mon May 15 14:12:08 2017
Command hhsearch -remove_ss_cap -E 1000000000 -d /home/cceaiac/levine/databases/pdb70/pdb70 -ssm 2 -cpu 1 -o /home/cceaiac/Scratch/Levine/results/test_YPR199C/YPR199C.0.ssw11.hhr -i /home/cceaiac/Scratch/Levine/results/test_YPR199C/YPR199C.0.ss.a3m -v 2 -p 0 -cov 50 -ssw 0.11 -Z 5000 -d /home/cceaiac/levine/databases/pfamA_31/pfam
No Hit Prob E-value P-value Score SS Cols Query HMM Template HMM
1 1sse_B AP-1 like transcription 98.8 2.2E-11 4.3E-16 97.0 5.7 60 235-294 26-85 (86)
2 1gd2_E Transcription factor PA 98.7 7.5E-10 1.4E-14 81.8 10.3 68 11-81 1-68 (70)
3 PF08601.9 ; PAP1 ; Transcripti 98.6 4.6E-10 8.8E-15 109.3 6.4 58 235-292 297-356 (356)
4 1gu4_A CAAT/enhancer binding p 98.2 2.4E-07 4.6E-12 69.2 10.8 61 14-77 11-71 (78)
5 1ci6_A Transcription factor AT 98.1 2.8E-07 5.3E-12 65.6 9.1 57 18-77 2-58 (63)
6 PF00170.20 ; bZIP_1 ; bZIP tra 98.1 4.7E-07 8.9E-12 63.8 9.7 58 18-78 5-62 (64)
7 2dgc_A Protein (GCN4); basic d 98.0 4.2E-07 8E-12 64.4 8.8 57 15-74 6-62 (63)
8 2wt7_A Proto-oncogene protein 98.0 7E-07 1.3E-11 62.9 9.5 57 18-77 2-58 (63)
9 1jnm_A Proto-oncogene C-JUN; B 97.9 1.2E-06 2.2E-11 61.0 9.1 57 19-78 2-58 (62)
10 PF03131.16 ; bZIP_Maf ; bZIP M 97.9 1.6E-06 3E-11 66.8 9.7 59 18-79 30-88 (90)
11 1t2k_D Cyclic-AMP-dependent tr 97.8 2.9E-06 5.5E-11 58.9 9.1 55 19-76 2-56 (61)
12 PF07716.14 ; bZIP_2 ; Basic re 97.6 6.7E-06 1.3E-10 56.1 7.5 51 17-70 4-54 (55)
13 1hjb_A Ccaat/enhancer binding 97.2 2.3E-06 4.4E-11 65.8 0.0 61 15-78 12-72 (87)
14 1dh3_A Transcription factor CR 96.8 1.1E-05 2.1E-10 55.7 0.0 52 19-73 2-53 (55)
15 5apu_A General control protein 96.8 0.00018 3.5E-09 59.1 7.0 48 19-73 46-93 (95)
16 3a5t_A Transcription factor MA 96.8 1.5E-05 2.8E-10 64.7 0.0 62 18-82 37-98 (107)
17 4c46_A General control protein 95.7 0.0076 1.4E-07 47.9 6.6 51 18-72 26-76 (76)
18 2wt7_B Transcription factor MA 95.4 0.0011 2E-08 51.8 0.0 60 18-80 27-86 (90)
19 1deb_A APC protein, adenomatou 95.3 0.018 3.5E-07 42.7 6.1 43 38-83 2-44 (54)
20 1kd8_B GABH BLL, GCN4 acid bas 95.1 0.015 2.9E-07 40.2 4.5 35 39-76 1-35 (36)
I should mention that those were not zero before the fix.
And some of the SS scores are still negative, both when the search is run online, and in my local version - try
>C9orf72
MSTLCPPPSPAVAKTEIALSGKSPLLAATFAYWDNILGPRVRHIWAPKTE
QVLLSDGEITFLANHTLNGEILRNAESGAIDVKFFVLSEKGVIIVSLIFD
GNWNGDRSTYGLSIILPQTELSFYLPLHRVCVDRLTHIIRKGRIWMHKER
QENVQKIILEGTERMEDQGQSIIPMLTGEVIPVMELLSSMKSHSVPEEID
IADTVLNDDDIGDSCHEGFLLNAISSHLQTCGCSVVVGSSAEKVNKIVRT
LCLFLTPAERKCSRLCEAESSFKYESGLFVQGLLKDSTGSFVLPFRQVMY
APYPTTHIDVDVNTVKQMPPCHEHIYNQRRYMRSELTAFWRATSEEDMAQ
DTIIYTDESFTPDLNIFQDVLHRDTLVKAFLDQVFQLKPGLSLRSTFLAQ
FLLVLHRKALTLIKYIEDDTQKGKKPFKSLRNLKIDLDLTAEGDLNIIMA
LAEKIKPGLHSFIFGRPFYTSVQERDVLMTF
Just to make sure we're comparing the same thing, what's the exact software (git tag) and databses versions in the online tool?
@croth1 , @martin-steinegger , any news on that?
SS scores can be negative. You could check the SS structure alignment of this negative scoring hits.
The 0 at the SS scoring can still occur when mixing SS types at the target db. (e.g. If some hmms don't have a SS structure or if some have just DSSP and other just Predictions)
I'm currently busy with writing my thesis. I might change the 0 score problem afterwards.
Hi,
When using both pdb and pfam:
hhsearch -d /home/ucgajhe/levine/databases/pdb70/pdb70 -ssm 4 -cpu 12 -o /home/ucgajhe/Scratch/Levine/results/test_YPR199C/YPR199C.0.ssw11.hhr -i /home/ucgajhe/Scratch/Levine/results/test_YPR199C/YPR199C.0.ss.a3m -v 2 -p 0 -cov 50 -ssw 0.11 -Z 5000 -d /home/ucgajhe/levine/databases/pfamA_30/pfam
we observe zero secondary structure scores for both PDB matches and PFAM matches:
but when running with PDB only, we get nonzero scores for all matches.
I note that the PDB database download includes SS data, but PFAM does not:
We note that secondary structure nonzero matches are found in the web-search tool, but that the downloadable version of hh-pfam does not have any SS info in it.
Most confusing of all, though, is why PDB matches become zero SS score when PFAM is present.
I think this might have something to do with the code in https://github.com/soedinglab/hh-suite/blob/master/src/hhviterbirunner.cpp
and this:
https://github.com/soedinglab/hh-suite/blob/master/src/hhhmm.cpp
which takes a minimum across the available data, so would result in zero SS for PDB when PFAM is present.
Any thoughts?