learningmatter-mit / GenZProt

24 stars 3 forks source link

Inference script is not working #1

Closed huhlim closed 1 year ago

huhlim commented 1 year ago

The inference script is not working as its "testset" length is zero. It is because "nxyz_data" list in "build_cg_dataset" function is empty.

SoojungYang commented 1 year ago

Hi @huhlim, thank you for letting us know!
For inference runs, nxyz_data list is supposed to be empty because we don't have all-atom xyz information.
There was a typo in the README (cd script -> cd scripts) and the example file (for topology) PED00055.pdb was not in the ./data directory, and I think this was the issue. We fixed the README and added the file, could you pull the data file and try again?

huhlim commented 1 year ago

Hi @SoojungYang, thank you for your update! I run the inference commands again but got an almost empty pkl file, which contained only array([], shape=(10, 0), dtype=float64). In my opinion, the problem is the length of testset variable is zero. https://github.com/learningmatter-mit/GenZProt/blob/81d5512681a953a2ce967e1258693f4c7c4f4ed0/scripts/inference.py#L152-L154 It is because the build_dataset function calls build_cg_dataset https://github.com/learningmatter-mit/GenZProt/blob/81d5512681a953a2ce967e1258693f4c7c4f4ed0/scripts/inference.py#L43-L57 and build_cg_dataset creates a pytorch dataset https://github.com/learningmatter-mit/GenZProt/blob/81d5512681a953a2ce967e1258693f4c7c4f4ed0/GenZProt/datasets.py#L656-L676 Unfortunately, the "len" method of CGDataset is defined by the length of nxyz_data list. https://github.com/learningmatter-mit/GenZProt/blob/81d5512681a953a2ce967e1258693f4c7c4f4ed0/GenZProt/data.py#L90-L101 I attempted to solve the issue but could not. If I misunderstood, please let me know.

SoojungYang commented 1 year ago

Hi @huhlim, you are right, there was an issue regarding data loading. Fixed the data loading (added functions CG_dataset_inf and CG_collate_inf) and it should work properly now. I also updated the final inference output generation part to provide both a numpy array and a pdb file for a better readability (please check updated README). Hope the problem is solved now! Again, thank you for your inputs and I'm sorry for the confusion.

huhlim commented 1 year ago

Thanks for the fix! It is working fine after correcting a typo, traj_to_into --> traj_to_info https://github.com/learningmatter-mit/GenZProt/blob/9bb8e57b256e6fee9854a9548b5d3f90b23f214b/scripts/inference.py#L123

huhlim commented 1 year ago

Sorry for reopening the issue. The inference script generated outputs. However, both N and C-termini residues were excluded from the generation.

SoojungYang commented 1 year ago

No worries! N and C-termini residues are truncated because our algorithm requires i-1th and i+1th C_alpha positions to locate the atoms of the ith residue. The topology of the generated pdb file is also truncated accordingly. I added the clarification on README. You can also refer to Appendix D.5. of the preprint. In future updates, we plan to include backmapping of the terminal residues.

huhlim commented 1 year ago

Thank you for the clarification!