question about dataset filtering

jasonkyuyim / se3_diffusion

Implementation for SE(3) diffusion model with application to protein backbone generation

https://arxiv.org/abs/2302.02277

MIT License

320 stars 51 forks source link

question about dataset filtering #16

Closed Wangchentong closed 1 year ago

Wangchentong commented 1 year ago

Hi,

Thank you and your team for open-sourcing this impressive work.I want to express my appreciation for your team's great work and for releasing it as open source. I'm really interested in understanding more about your data filtering strategy.

1. I noticed that you only trained the model on monomeric CIF files with max_len < 512. Would it be beneficial to include single chains from multimeric CIF files as well? This approach would significantly increase the dataset size by 5x to 100k proteins, even though their structure may not be true in the monomer state.

2. Have you considered expanding the dataset using representative clustered proteins in the high confidence AFDB? If not, what is your opinion on this approach?

Thank you for your help and I look forward to hearing back from you soon.

jasonkyuyim commented 1 year ago

Hi,

Thanks for the message.

Yes we only trained on monomers to simplify the application of our method. A follow-up work would seek to train on multimeric chains and do more interest tasks (binder design, protein-protein interaction). Our paper was primarily concerned with the SE(3) diffusion methodology and theoretical groundings which was very non-trivial! We have found more data to lead to better results so I am optimistic more data in these regimes would help.
The thought of using AFDB high confidence proteins did cross my mind but was not high priority for the reasons I stated above; it was outside the scope of this work. I think it is a very interesting extension as we've seen that AF2 distillation in RFdiffusion help it tremendously. Here it will be interesting to see how a pure MSA-free generative model can learn from the same source of structures.

Happy to answer any more questions!

Wangchentong commented 1 year ago

Great! Actually i have two more questions😄

1.About pretraining with structure prediction task ：your lab has published a paper to do protein structure prediction task with generative model（EIGENFOLD），it's convinient to cooperate this task into framediff（or use esm2 output rather than single sequence embedding），the power of structure prediction pretraining has been proved by rfdiffusion。have you consider that training a structure prediction model with frame diff as a start point？

2.Have you considered to extend your work to small molecule binder generation？

Wangchentong commented 1 year ago

Besides， “as we've seen that AF2 distillation in RFdiffusion help it tremendously”，is it in RFdiffusion paper? It would be nice if you can explain it a little more😊

jasonkyuyim commented 1 year ago

I view RFdiffusion as an extension of FrameDiff where pre-trainedRosettaFold is trained with the the SE(3) diffusion derived in FrameDiff. (I was involved in incorporating SE(3) diffusion into RFdiffusion.) There are some minor differences like the rotation loss. I think additional investigation into the benefits of pre-training is an important one (as we stated in the paper). What I meant by AF2 distillation is that RosettaFold uses a form of distillation from AFDB so I think it plays a large part in why RFdiffusion works well.
I am personally not working on this. I think it is a great application.

Btw I'm extremely busy with end of semester matters. My responses will likely be delayed.

Wangchentong commented 1 year ago

Never mind, really aprreciate your reply! A final question, May I ask for your insight on the benefits that a lightweight model like framdiff(without axial attention and msa transformer ) can achieve compared to RosettaFold in structure prediction pre-training?

jasonkyuyim commented 1 year ago

I beleve a lightweight model like FrameDiff is better for methods research. It's easier to train and analyze rather than having to go through the RosettaFold pipeline which is not publicly trainable. Obviously larger more complex models will perform the best but uncovering the scaling principles will be done first through lightweight models.