jasonkyuyim / se3_diffusion

Implementation for SE(3) diffusion model with application to protein backbone generation
https://arxiv.org/abs/2302.02277
MIT License
305 stars 50 forks source link

errors in processing mmcif file #4

Closed pengzhangzhi closed 1 year ago

pengzhangzhi commented 1 year ago

As indicated in your readme, using processed_pdb_dataset.py to process mmcif files. But this file does no exist in CWD.

python processed_pdb_dataset.py --mmcif_dir <pdb_dir> 

I found a similar file in data/process_pdb_dataset.py. I download a part of mmcif from pdb by interrupting the downloading process.

...
bk/6bkl.cif.gz
bk/6bkm.cif.gz
^Crsync error: received SIGINT, SIGTERM, or SIGHUP (code 20) at rsync.c(644) [generator=3.1.2]
rsync error: received SIGUSR1 (code 19) at main.c(1447) [receiver=3.1.2]

I run python process_pdb_dataset.py --mmcif_dir ../mmCIF/ to generate data and got the following error.


Traceback (most recent call last):
  File "/root/se3_diffusion/data/process_pdb_dataset.py", line 306, in <module>
    main(args)
  File "/root/se3_diffusion/data/process_pdb_dataset.py", line 292, in main
    all_metadata = pool.map(_process_fn, all_mmcif_paths)
  File "/opt/anaconda3/envs/se3/lib/python3.9/multiprocessing/pool.py", line 364, in map
    return self._map_async(func, iterable, mapstar, chunksize).get()
  File "/opt/anaconda3/envs/se3/lib/python3.9/multiprocessing/pool.py", line 771, in get
    raise self._value
UnicodeDecodeError: 'utf-8' codec can't decode byte 0x8b in position 1: invalid start byte

Is this problem caused by incomplete mmcif files? because I interrupted the downloading...

jasonkyuyim commented 1 year ago

Thanks for the issue! Yes, the correct script is process_pdb_dataset.py. I forgot a step in the data processing. You must unzip all the *.cif.gz files with the following command gzip -d **/*.gz. I've updated the README. Please update here if you still have any issues.

pengzhangzhi commented 1 year ago

Thank you! It still don't work. No files have been processed. I have execute gzip -d mmCIF/**/*.gz and all .gz files are extracted.

se3_diffusion/data# python process_pdb_dataset.py --mmcif_dir ../mmCIF/
/opt/anaconda3/envs/se3/lib/python3.9/site-packages/Bio/Data/SCOPData.py:18: BiopythonDeprecationWarning: The 'Bio.Data.SCOPData' module will be deprecated in a future release of Biopython in favor of 'Bio.Data.PDBData.
  warnings.warn(
Gathering mmCIF paths
100%|█████████████████████████| 1060/1060 [00:00<00:00, 10124.13it/s]
Processing 12944 files our of 13795
Files will be written to ./data/processed_pdb
Finished processing 0/12944 files

In addition, I am wondering that if downloading the whole pdb mmcif files is necessary. Since protein data is largely redundant. Maybe a portion of data with serious preprocessing (Sequnece-ID, length, resolution, etc.) will be enough?

pengzhangzhi commented 1 year ago

I turn on the verbose mode, the error becomes more clear:

Failed ../mmCIF/b8/1b8c.cif: Mdtraj failed with error The topology is loaded by filename extension, and the detected ".cif" format is not supported. Supported topology formats include ".pdb", ".pdb.gz", ".h5", ".lh5", ".prmtop", ".parm7", ".prm7", ".psf", ".mol2", ".hoomdxml", ".gro", ".arc", ".hdf5" and ".gsd".

Seems like the .cif is the problem. I did file extraction by your instruction. Still confused... I wondering if pdb file is supported by your preprocessing script?

jasonkyuyim commented 1 year ago

I've fixed this now https://github.com/jasonkyuyim/se3_diffusion/commit/08c1e98bf4be1f77b9b40ff4eeb64ddf960a953a

I think I must have installed MDtraj from source in order to absorb this commit https://github.com/mdtraj/mdtraj/issues/652

But MDtraj's latest release does not support mmcif. I hacked in writing to a temporary PDB file to calculate the MDtraj quantities. This now works for me but please raise any more issues! I'll try to address them asap. Thanks again for the beta testing ;)

jasonkyuyim commented 1 year ago

Regarding your question about downloading all of PDB. Agreed downloading all of PDB is a big waste since we don't use most of it. I'm currently busy with other duties but it is on my list to create a processed monomer dataset that is deduplicated https://files.wwpdb.org/pub/pdb/data/monomers/ . I'm not sure how to best host it. If you (or anyone else) would like to help then please email me. Until then, I'm afraid you'll have to download PDB and follow the preprocessing steps I used for the project. We thought it was more important to get a clean implementation out then gradually improve the code and data.

pengzhangzhi commented 1 year ago

Yep! You are very thoughful! I am wondering if you can adopt your data preprocessing pipeline to pdb file. Since a lot of existing great pdb datasets are off-the-thelf.

jasonkyuyim commented 1 year ago

Sorry for the delay, I got around to adding a file: process_pdb_files.py. This is not a complete script, it is starter code of how to process PDB files into the pickle format of OpenFold/AF2 which is consumed in our model. It does not do any of the extra filtering or additional processing in process_pdb_dataset.py.

We originally worked with PDB files then switched to MMCIF. I hope this is helpful and I imagine it won't be too hard to modify the script to work with SE(3) diffusion; I just don't have time or the use case.

pengzhangzhi commented 1 year ago

Thanks!