I am currently working on a project where I would like to grid multiple protein pockets (~10 000).
Using the Grid single molecule tutorial, I figured that it would be quick to use the CoordinateSet objects to create the grids, instead of using the ExampleProvider with the (gnina)types files. However, when I trained a model with these grids as input, my model reached OOM errors.
A snapshot of my code is this one:
gmaker = molgrid.GridMaker()
grid_dims = gmaker.grid_dimensions(molgrid.defaultGninaReceptorTyper.num_types())
#pdb_ids is a list of ~ 16 pdb_ids in a batch given to a model
mols = [next(pybel.readfile("pdb", f"../v2019-other-PL/{pdb_id}/{pdb_id}_pocket.pdb")) for pdb_id in pdb_ids]
coord_set = [molgrid.CoordinateSet(mol, molgrid.defaultGninaReceptorTyper) for mol in mols]
batch_dims = (len(coord_set), *grid_dims)
batch_grid = torch.zeros(batch_dims, dtype=torch.float32).to(self.device)
for i in range(batch_grid.shape[0]) :
gmaker.forward(coord_set[i].center(), coord_set[i], batch_grid[i])
I traced back the memory leak to be depending on the CoordinateSet initialization line. I created a script to show that the memory usage can be reproduced using the tutorial molecule. Creating 100 000 times the CoordinateSet for the sdf takes 1.5 GB of RAM, which is fine if the dataset contains i.e. 100 000 small molecules. However, my dataset contains 10 000 protein pockets, which leads to larger memory leaks that prevents me to run the model for multiple epochs.
import molgrid
import os
import psutil
from molgrid.openbabel import pybel as pybel
import matplotlib.pyplot as plt
memory_usages = []
for _ in range(100000) :
#sdf is the sdf molecule embedded as a string in the tutorial
mol = pybel.readstring('sdf',sdf)
coord_set = molgrid.CoordinateSet(mol)
usage = psutil.Process(os.getpid()).memory_info().rss / 1024 ** 2
memory_usages.append(usage)
plt.plot(memory_usages)
plt.title('Creating CoordinateSet from the OBMol for a single molecule 100 000 times')
plt.xlabel('Iteration')
plt.ylabel('Memory usage (MB)')
plt.savefig('memory_leak_coordinate_set', bbox_inches='tight')
A potential fix would be to store the CoordinateSets for all my protein pockets (in a python dict or in pickle files), but I think it would be easier if we were able to compute it on the fly at each train iteration.
Do you know if you can fix this memory leak ? If not, do you have another quick and easy code to grid proteins on the fly, or should I just store the CoordinateSets ?
Definitely a bug, likely with interacting with the python reference counter. The quickest workaround is to put your protein file names into a file and read it in with ExampleProvider.
Hi,
I am currently working on a project where I would like to grid multiple protein pockets (~10 000). Using the Grid single molecule tutorial, I figured that it would be quick to use the CoordinateSet objects to create the grids, instead of using the ExampleProvider with the (gnina)types files. However, when I trained a model with these grids as input, my model reached OOM errors.
A snapshot of my code is this one:
I traced back the memory leak to be depending on the CoordinateSet initialization line. I created a script to show that the memory usage can be reproduced using the tutorial molecule. Creating 100 000 times the CoordinateSet for the sdf takes 1.5 GB of RAM, which is fine if the dataset contains i.e. 100 000 small molecules. However, my dataset contains 10 000 protein pockets, which leads to larger memory leaks that prevents me to run the model for multiple epochs.
A potential fix would be to store the CoordinateSets for all my protein pockets (in a python dict or in pickle files), but I think it would be easier if we were able to compute it on the fly at each train iteration.
Do you know if you can fix this memory leak ? If not, do you have another quick and easy code to grid proteins on the fly, or should I just store the CoordinateSets ?
I use molgrid v0.5.1 installed with pip
Thanks a lot