soedinglab / hh-suite

Remote protein homology detection suite.
https://bmcbioinformatics.biomedcentral.com/articles/10.1186/s12859-019-3019-7
GNU General Public License v3.0
545 stars 134 forks source link

[Suggestion] Change the Wiki's recommendation for multi-domain proteins #267

Open apcamargo opened 3 years ago

apcamargo commented 3 years ago

In the Wiki it is stated:

For long sequences, it may therefore be of advantage to first search the PDB or the SCOP domain database and then to cut the query sequence into smaller parts on the basis of the identified structural domains. Pfam or CDD are - in our opinion - less suitable to determine domain boundaries.

I'm not sure if PDB is a good choice for multi-domain proteins though, as it contains some unprocessed polyproteins that will usually have lower E-values than each individual domain (eg.: https://www.rcsb.org/structure/2IJD).

Also, is there any specific reason for Pfam to be less suitable for boundaries? I've been using it together with SCOP and got good results.

martin-steinegger commented 3 years ago

Thank you for the remark. Answer from @soeding :

"Many Pfam domain families were founded when no structures of member was yet available. Oftentimes, the domain boundaries defined by sequence-based methods have been quite inaccurate, comprising fractions of a domain or domains-and-a-half etc. Pfam has historically be very slow in updating their Pfam family definition to harmonize with the domain boundaries elucidated by protein structure determination. Therefore, Pfam is less suited to determine boundaries of structural/functional domains than CATH / SCOP / ECOD based on the PDB."

apcamargo commented 3 years ago

Thanks! I got it now!

I managed to get the download links for the HHPred databases, so I can use SCOPe and ECOD now. Regardless, my only concern is that PDB contains some precursors and we shouldn't just expect the matches to be unit domains (unless you remove precursors and polyproteins from the database beforehand).