Upgrade dependency on psipred to avoid dependency on Legacy Blast

jamespjh commented 7 years ago

HH-suite dependency on psipred requires NCBI tools, including Makemat which is in legacy Blast but not Blast+.

It would make it easier when installing hh-suite if it could depend on the current version of psipred, which @DanBuchan tells me doesn't need the legacy blast.

jamespjh commented 7 years ago

The dependency in question is for the perl scripts in https://github.com/soedinglab/hh-suite/blob/master/scripts/HHPaths.pm

DanBuchan commented 7 years ago

Strictly speaking BLAST+ support is experimental for PSIPRED but we are using it in production on our web server and as best as we can tell the results are equivalent. They haven't been fully benchmarked to formally prove this equivalence which is why it remains experimental.

This PSIPRED and legacy BLAST dependency arises in the addss.pl script. I've attached the version of this script which I've monkey patched for BLAST+. Note the changes at lines 555, 558-560, 563-565 and 588

addss.zip

A more robust solution would choose which of the .mtx file creation steps to take depending on the blast version provided.

As a general comment the addss.pl script should probably deprecate versions of PSIPRED prior to 3.5 given the difference in prediction accuracy from the pre-v3.0 releases to today's v4.0

lukaszimmermann commented 7 years ago

Hello, is there a reason why one would employ BLAST+ rather than legacy BLAST for PSIPRED, except for installation convenience? Is it known that BLAST+ produces more accurate results for the SS prediction than legacy BLAST when used as dependency for PSIPRD?

I support the fact that legacy BLAST should be deprecated, but as long as PSIPRED states (quoting the README):

Please see the README file in the BLAST+ subdirectory for more information on PSIPRED's support for BLAST+.

I feel quite reluctant switching to BLAST+ here, especially if is untested and considered experimental. Except, of course, there is an impact on PSIPRED's function.

DanBuchan commented 7 years ago

Sorry for the delay, as I have returned from holiday and finished writing a paper. To clarify I work on PSIPRED and the PSIPRED webserver in David Jones' group

The principal reason for moving from legacy blast is that it is no longer supported by the NCBI. So it no longer includes bug fixes or improvements added to BLAST+ since 2012.

To our understanding BLAST and BLAST+ produce equivalent SS predictions for PSIPRED. But BLAST+ brings with it a number of useful features we make use of. As BLAST+ allows MSA input you can run SS prediction over an MSA with BLAST+. A feature available on our website (or with a trivial change to the runpsipredplus script).

With regards moving to BLAST+, we recommend legacy blast only in so far as we guarantee it produces results with a sensitivity and selectivity quoted in the last PSIPRED paper. BLAST+ support is not untested and the PSIPRED webserver has being using the BLAST+ version of psipred for 6+ years. It is for now unpublished which is why is remains "experimental" although that is something of an academic point.

As I stated before the most robust solution would be for addss.pl to support both legacy and blast+ for PSIPRED.

lukaszimmermann commented 7 years ago

Thank you very much for the clarification.

I guess we will then also change to an adapted version of addss.pl, which uses BLAST+. With the background of your comment I no longer see the point in using legacy BLAST in addss.pl.

milot-mirdita commented 7 years ago

There is another issue for us: The template selection neural networks in HHpred were trained (a long time ago) with based on that old psipred and blast version. First we would have to benchmark if the switch has any impact on HHpred performance. If it does have a negative impact, we would have to take investigate and probably also bug a few former members of our group to retrain those on the output of the new versions.

We are however lacking a bit in man power. Markus, the current maintainer of the HH-suite, is finishing up his PhD. I was the last one involved with HHpred and I will also leave the group soon. If you have any manpower to spare and evaluate the impact upon HHpred, we could upgrade these dependencies.

lukaszimmermann commented 7 years ago

I need to talk to our Toolkit Manager about this issue, who will return from vacation next week.

lukaszimmermann commented 7 years ago

However, wouldn't it be a good idea to retrain the neural networks (on BLAST+ and new PSIPRED) anyway and compare the performance to the current version of HHpred? I would be interested in the outcome. If the performance is comparable, we can simply upgrade then. I would not expect the performance to decrease.

milot-mirdita commented 7 years ago

Probably yes, however I am not sure who originally did the training. We would have to find out who it was (Armin?) and get them to redo it/document the process.

DanBuchan commented 7 years ago

FWIW it is a time and manpower issue that has also prevented us from proving that BLAST and BLAST+ produce equivalent PSIPRED outputs ;)

meiermark commented 7 years ago

opened a project for this purpose

tamuanand commented 7 years ago

Hi

I would like to know if anybody has gotten the 'Build customized HHSuite Databases' to work?

If yes, I would really appreciate your inputs on

the workaround you have used for PsiPred -- which version of PsiPred did you use and what tweaks were required for the 'addss.pl' and/or any other scripts
where to obtain the dsspcmbi binary -- I did look up the dssp ftp site but could not find something that is named 'dsspcmbi'. Should one just download the latest and rename it as dsspcmbi for building custom database to work?

I myself have a thread seeking help related to the errors I encounter when building custom databases and likewise I see an earlier thread on the same topic.

I would appreciate any help on the above.

Thanks, Anand

tamuanand commented 5 years ago

Strictly speaking BLAST+ support is experimental for PSIPRED but we are using it in production on our web server and as best as we can tell the results are equivalent. They haven't been fully benchmarked to formally prove this equivalence which is why it remains experimental.

This PSIPRED and legacy BLAST dependency arises in the addss.pl script. I've attached the version of this script which I've monkey patched for BLAST+. Note the changes at lines 555, 558-560, 563-565 and 588

addss.zip

A more robust solution would choose which of the .mtx file creation steps to take depending on the blast version provided.

As a general comment the addss.pl script should probably deprecate versions of PSIPRED prior to 3.5 given the difference in prediction accuracy from the pre-v3.0 releases to today's v4.0

I realize that this thread is very old

I recently saw that DanBuchan has updated runpsipred with the note "added experimental hhblits support" https://github.com/psipred/psipred/blob/master/runpsipred

My question: Does one need to change paths etc in HHPaths.pm and/or addss.pl to use runpsipredplus which used BLAST+ instead of legacy blast

Thanks in advance

DanBuchan commented 5 years ago

Well a huge problem for us with BLAST+ is that the PSSMs produced are only output at 1 significant figure. Where as for legacy blast they have 3 significant figures. We've been toying with a number of hacks to get round this. PSIPRED includes a script chkparse which takes a BLAST+ PSSM and tries ot interpolate the missing data and output a legacy blast format PSSM. This is ok but not great and not sufficiently accurate for other SVM based tools we maintain.

Our current thinking for our new web server is to skip doing the sequence searching with PSIBLAST altogether. So we search with HHBlits, extract the alignment and use legacy blast over those sequences to output an old style PSSM. This has the benefit of working for all our methods but the huge drawback in that the NCBI have completely stopped distributing legacy blast. If you want to see how this works check the hhblits_psipred_hack.sh script and not the runpsipred script.

Everything about this is unsatisfactory. Our prefered solution would be one of

a) We switch entirely to HHBlits and calculate the PSSM directly from the HHBlits alignments and rebenchmark/retrain ALL our methods. This would be the gold standard fix but time and man power is not as available as we'd like b) The BLAST developers add the option to increase the number of significant figures to the output PSSMs. We have put a feature/bug request in to have this added but have no idea when this might come about.

ahcm commented 5 years ago

Do the additional significant numbers really add information? My naive assumption would be that it’s just noise. Can you elaborate on this?

Thanks Andy

tamuanand commented 5 years ago

Our current thinking for our new web server is to skip doing the sequence searching with PSIBLAST altogether. So we search with HHBlits, extract the alignment and use legacy blast over those sequences to output an old style PSSM. This has the benefit of working for all our methods but the huge drawback in that the NCBI have completely stopped distributing legacy blast. _If you want to see how this works check the hhblits_psipredhack.sh script and not the runpsipred script.

Can you point to where the hhblits_psipred_hack.sh script is available?

Thanks, Anand

DanBuchan commented 5 years ago

@ahcm Well it is something of a resolution/precision issue. old blast returned figures in the range 0-999 and blast+ returns a single integer 0-9.

@tamuanand it is in the experimental directory in the current PSIPRED repo.

gnmcsbnfrmtcsclb commented 5 years ago

What is the current state of things now in June 2019? Which versions of psipred, its BLAST dependency, and accompanying DSSP version and corresponding download links should we be using?

soedinglab / hh-suite

Upgrade dependency on psipred to avoid dependency on Legacy Blast #36