sokrypton / ColabFold

Making Protein folding accessible to all!
MIT License
1.96k stars 493 forks source link

High amount of clashes in multimer models #72

Open dennissv opened 3 years ago

dennissv commented 3 years ago

Hi, thanks for the update have had great use of ColabFold so far!

Using the new multimer models I very often get a high amount of clashes that can be distracting while trying to look at interactions, and hiding them can be time consuming when going through a larger amount of predictions. This only seems to happen when there is larger disordered regions. And if the disordered regions are very large the proteins can become just a blob. One less extreme example below (from your new multimer notebook);

cf_clashes_example

The disordered region after the well predicted domain is tangled and blocks the view of the interface. It also calls into question if the interface is actually possible due to the disordered region originating from that location.

This is something that I've noticed in the AlphaFold notebooks, our local AlphaFold setup and now in your new multimer notebook. Your original code for complexes however seem to always respect stereochemistry. Is this just how the new multimer models handle uncertainty or is something going wrong? And how are we supposed to interpret and work around this high amount of clashes. Hide and ignore them or just throw away the prediction as too low confidence?

Thanks for your help!

martin-steinegger commented 3 years ago

It has todo how the models were trained. According to Richard Evans. Both models AF2 and AF2-multimer were trained in two steps: (1) the initial weights were trained without any violation/clash losses and (2) then fine-tuned with clash loses.
However (2) step made the multimer predictions worse. So, alphafold-multimer was not fine-tune for as long as the original AF2 model. Resulting in more clashes. When the multimer is highly uncertain it can put the chains on top of each other.

zach-hensel commented 2 years ago

In original AF2 model you don't get this but you get a self-avoiding sphere that hopefully isn't still in the middle of another chain when you run out of recycles... the relaxation step can last indefinitely if that happens and you have relaxation enabled. The model's not going to do what it wasn't trained (specifically or otherwise) to do. I figure there are lots of examples of things not in the PDB or very underrepresented where MD can be used for data augmentation.

Maybe it's possible to start without predicted disordered domains and use that prediction as a template.