PDB Data Preprocess scripts

pengzhangzhi commented 1 year ago

Hi Yim, I would like to contribute to one of your TODO, namely Set-up easily downloadable training data. To complete this task, we need to write a PDB data preprocessing script, since your original script only supports .mmcif format. Based on your existing script data/process_pdb_files.py, I further implement a few features so that this script does the same things as data/process_pdb_dataset.py, where it only deals with .mmcif files. Now, with the updated data/process_pdb_files.py, we can build protein structure datasets from raw pdb format files. Using pdb files has many advantages. For example, many cleaned protein datasets are based on pdb format like CATH. I am also trying to reproduce the training using CATH data. I would like to see if using small but high-quality protein structure data can reproduce similar results as the se3 diffusion. Based on OpenFold's finding that using a small portion of structures can achieve competitive performance as AF2, I am confident about se3 diffusion. : ) I will open the training details, logs, and evaluation results. stay tuned!

You can review the commit history to see the details I made.

Best, Zhangzhi

amorehead commented 1 year ago

@pengzhangzhi, in case you or anyone else finds it useful, Arian Jamasb (lead developer of Graphein) and I recently released a generic Python PDBManager class within Graphein that allows you to conveniently and powerfully curate ML-ready PDB file datasets directly from the RCSB PDB: https://github.com/a-r-j/graphein/blob/master/notebooks/creating_datasets_from_the_pdb.ipynb

jasonkyuyim commented 1 year ago

@pengzhangzhi This is great. Thank you! I'll take a look at the pull request near the tail end of this week. Looking forward to seeing what results you get on CATH.

@amorehead Thanks for the pointer!

jasonkyuyim / se3_diffusion

PDB Data Preprocess scripts #15