julie-forman-kay-lab / IDPConformerGenerator

Build conformational representations of Intrinsically Disordered Proteins and Regions by a guided sampling of the protein torsion space
https://idpconformergenerator.readthedocs.io/
Apache License 2.0
15 stars 6 forks source link

Using Alpha Fold structures to construct the torsion database #241

Open joaomcteixeira opened 1 year ago

joaomcteixeira commented 1 year ago

When we first conceived IDPConfGen, we used Dunbrack's PISCES culled files as a list of non-redundant PDB files that we could use to generate IDPCG's database of observed torsion angles. However, PISCES culled lists update constantly (which is good) following the release of new PDB structures, forcing us to maintain registry of versions. Moreover, all this logic and infrastructure we built was before Alpha Fold :wink:.

Now, we could now use Alpha Fold Homo sapiens predicted proteome to build a reliable torsion angle database that has a honest distribution of observed torsion angles and is not biased by criteria for non-redundancy and the experimental structures available. Besides, such database will not need constant updates (until AF produces a new dataset). I believe Alpha Fold structures are already devoided of structural inconsistencies, which would further improve the reliability of the database.

Alpha Homo sapiens database :point_right: https://alphafold.ebi.ac.uk/download#proteomes-section

The Homo sapiens dataset is already extensive. But we can consider expanding it later with other model organisms. Considerations on file size are necessary.

We cannot take all parts of the structures that Alpha Folder predicted because of the presence of large disordered regions. But I think all residues with prediction scores above 70 are reliable.

If we do this, we can distribute to users the torsion database, reliably.

The existing clients used to create a database will still be useful (we should maintain them) but would be much less relevant.

What do you think? Cheers,

menoliu commented 1 year ago

Great point! I also concur that we should not consider any residues <=70 on their pLDDT metric as those models do not use the power of deep MSA. However, we must take into consideration of conditionally folded conformers? I.e. some disordered proteins are represented as conditional folders and those can be captured in AlphaFold... what to do about them?

menoliu commented 1 year ago

@joaomcteixeira I've just spoken to Julie and she's not very enthusiastic on the idea for us to use AF structures as those predicted structures are not equivalent to experimental structures. However I agree with Julie that we should at least update the database we're using with the latest PISCES/RCSB structures.

I was also thinking, the size of the database is somewhat related to the speed of building due to the filtering process, if our database is very large (with the human proteome) wouldn't that affect filtering for longer-ish IDPs? I think if the users wanted, they could use AF structures in their database nonetheless :)