jyaacoub / MutDTA

Improving the precision oncology pipeline by providing binding affinity purtubations predictions on a pirori identified cancer driver genes.
1 stars 2 forks source link

Improve performance of PDB parser #86

Closed jyaacoub closed 4 months ago

jyaacoub commented 6 months ago

PDB parser takes a lot of time to parse 50 confirmations (see #84), however my script for parsing is already as optimal as it can be (for python code). Next step should be to write the parser in c or rust.

comparison of my parser vs prody.parsePDB

image

CODE:

from prody import parsePDB
import numpy as np
from src.utils.residue import Chain, Ring3Runner
import logging
import time

logging.getLogger().setLevel(logging.INFO)
logging.getLogger('.prody').setLevel(logging.WARNING)

start_time = time.time()

# af_confs = '/cluster/home/t122995uhn/projects/data/pdbbind/alphaflow_io/out_pid_ln/1c5c.pdb'
pid = '2zq0'
af_confs = f'/cluster/home/t122995uhn/projects/data/pdbbind/alphaflow_io/out_pdb_MD-distilled/{pid}.pdb'
# pdb_fp = '/cluster/projects/kumargroup/jean/data/pdbbind/v2020-other-PL/1c5c/1c5c_protein.pdb'
pdb_fp = f'/cluster/projects/kumargroup/jean/data/pdbbind/v2020-other-PL/{pid}/{pid}_protein.pdb'
target_seq = Chain(pdb_fp).sequence

# Timing get_all_models
get_all_models_start = time.time()
chains = Chain.get_all_models(af_confs)
logging.info(f"get_all_models: {time.time() - get_all_models_start} seconds")

get_all_models_start = time.time()
chains = [parsePDB(af_confs, subset='ca', chain='A', model=i) for i in range(50)]
logging.info(f"get_all_models: {time.time() - get_all_models_start} seconds")
print(len(target_seq))