smiles724 / GGNN_Meets_PLM

21 stars 3 forks source link

Question regarding how to extract protein representation from your pretrain model #3

Open Tizzzzy opened 1 month ago

Tizzzzy commented 1 month ago

Hi author, Huge fan of your work. I am currently trying to apply your code to a downstream task. Specifically, I am working with a protein PDB file and aim to extract the latent representation of the protein using your pretrained model. I was wondering if this is possible. If it is, could you kindly show me which script I should run and which line of code store the representation of the protein? Thank you so much for your time

smiles724 commented 1 month ago

Hi author, Huge fan of your work. I am currently trying to apply your code to a downstream task. Specifically, I am working with a protein PDB file and aim to extract the latent representation of the protein using your pretrained model. I was wondering if this is possible. If it is, could you kindly show me which script I should run and which line of code store the representation of the protein? Thank you so much for your time

Hi, thanks for your interest in my work and attempt to adopt it for your personal tasks. Would you mind me asking what type of this PDB is? Is this a single-chain protein or a complex composed of multiple chains (any ligand exists)? You know, different modalities require different feature extractors.

Tizzzzy commented 1 month ago

Hi author, Thanks for your reply. I have both single-chain protein (doesn't contain ligand) and multiple chains protein (contain ligand). And I would like to extract the representation from the pretrained model for all of them. Is it possible?

smiles724 commented 1 month ago

Hi author, Thanks for your reply. I have both single-chain protein (doesn't contain ligand) and multiple chains protein (contain ligand). And I would like to extract the representation from the pretrained model for all of them. Is it possible?

Hi, your query is interesting and I got clearer of what you need.

For single-chain proteins without ligands, a more related task is the Modal Quality Assessment (MQA). You can use the PSRTransform in https://github.com/smiles724/GGNN_Meets_PLM/blob/259f0c526521427c0fa337c49e4abece3daf6467/gvp/atom3d.py#L144 to transform the loaded data. The corresponding model architecture is in https://github.com/smiles724/GGNN_Meets_PLM/blob/259f0c526521427c0fa337c49e4abece3daf6467/gvp/atom3d.py#L94 (see BaseModel).

For multiple-chains with ligands, a more related task is the ligand-binding affinity (LBA), the data loader and model are in https://github.com/smiles724/GGNN_Meets_PLM/blob/259f0c526521427c0fa337c49e4abece3daf6467/gvp/atom3d.py#L282 and https://github.com/smiles724/GGNN_Meets_PLM/blob/259f0c526521427c0fa337c49e4abece3daf6467/gvp/atom3d.py#L349 respectively.


However, it is worth mentioning that the data used in my paper were processed and zipped into the LMDB format for reproduction convenience (see Atom3D for more details). You must do similar work to preprocess your PDB data into the correct form to run the script.