Closed pengzhangzhi closed 1 year ago
@pengzhangzhi, in case you or anyone else finds it useful, Arian Jamasb (lead developer of Graphein) and I recently released a generic Python PDBManager
class within Graphein that allows you to conveniently and powerfully curate ML-ready PDB file datasets directly from the RCSB PDB: https://github.com/a-r-j/graphein/blob/master/notebooks/creating_datasets_from_the_pdb.ipynb
@pengzhangzhi This is great. Thank you! I'll take a look at the pull request near the tail end of this week. Looking forward to seeing what results you get on CATH.
@amorehead Thanks for the pointer!
Hi Yim, I would like to contribute to one of your TODO, namely
Set-up easily downloadable training data.
To complete this task, we need to write a PDB data preprocessing script, since your original script only supports .mmcif format. Based on your existing scriptdata/process_pdb_files.py
, I further implement a few features so that this script does the same things asdata/process_pdb_dataset.py
, where it only deals with .mmcif files. Now, with the updateddata/process_pdb_files.py
, we can build protein structure datasets from raw pdb format files. Using pdb files has many advantages. For example, many cleaned protein datasets are based on pdb format like CATH. I am also trying to reproduce the training using CATH data. I would like to see if using small but high-quality protein structure data can reproduce similar results as the se3 diffusion. Based on OpenFold's finding that using a small portion of structures can achieve competitive performance as AF2, I am confident about se3 diffusion. : ) I will open the training details, logs, and evaluation results. stay tuned!You can review the commit history to see the details I made.
Best, Zhangzhi