soedinglab / hh-suite

Remote protein homology detection suite.
https://bmcbioinformatics.biomedcentral.com/articles/10.1186/s12859-019-3019-7
GNU General Public License v3.0
515 stars 128 forks source link

Role of cstranslate in building a custom database? #250

Open spetti opened 3 years ago

spetti commented 3 years ago

I am building a custom database according the instructions in the user guide. I would like to understand more specifically how cstranslate converts an alignment to an abstract state alphabet. In particular, does the translation of one specific alignment depend on the other alignments in the database?

For context, I am benchmarking hhblits in a way that requires building a custom database from both real protein sequences and randomly generated synthetic sequences. When I make a custom database (following the pipeline described in the user guide*) on a small set of 250 real sequences and 500 synthetic sequences, the database looks great and performs as expected. However, when I do the same thing for a set of 20,000 real sequences and 200,000 synthetic sequences, the performance on the resulting database is very poor. About 75% of the profile-profile pairs that correspond to true positives aren't being compared, presumably because of the prefilter.

Since the prefilter is not acting as expected and the user guide has an ominous warning ("Be careful with the parameters. Leaving anything out, might result in a database that looks superficially correct, but will perform very badly." ), my guess is that the issue stems from cstranslate. Could the 200,000 alignments corresponding to the synthetic sequences degrade the performance of cstranslate for all the alignments? Any information on how cstranslate works would be much appreciated. (I am trying to avoid having to read the source code.) Any other possible explanations for the poor performance and/or debugging suggestions would also be much appreciated!

milot-mirdita commented 3 years ago

cstranslate maps each profile columns (20 values) to one of 219 prototypical profile states (1 value). The prefilter can then relatively fast while still being extremely sensitive, do ungapped alignments of queries to these profile states to uncover possible remote homology. If a query is accepted it's further processed with progressively slower algorithms.

If you see the ERROR: No abstract states provided! error you should upgrade to the latest release. I changed that it would do the right thing by default. A different problem was in the old version that if you didn't provide the -b flag it would produce something that only the old unsupported HH-suite 2.x could read. I also dropped that flag recently.

I can't say how our tools will react to synthetic sequences. As a lot of assumptions about real proteins are built into the methods, it might potentially go pretty badly.

In our benchmarks we usually use either shuffled or reverted sequences to keep the amino acid composition intact.

But it might just as well be this weird -b flag. Here you can find an invocation of cstranslate in one of our prebuilt databases: https://github.com/soedinglab/hhdatabase_cif70/blob/master/pdb70_cstranslate.sh