gnina / libmolgrid

Comprehensive library for fast, GPU accelerated molecular gridding for deep learning workflows
https://gnina.github.io/libmolgrid/
Apache License 2.0
141 stars 47 forks source link

Memory leak using CoordinateSet init with OBMol #73

Closed bbaillif closed 3 years ago

bbaillif commented 3 years ago

Hi,

I am currently working on a project where I would like to grid multiple protein pockets (~10 000). Using the Grid single molecule tutorial, I figured that it would be quick to use the CoordinateSet objects to create the grids, instead of using the ExampleProvider with the (gnina)types files. However, when I trained a model with these grids as input, my model reached OOM errors.

A snapshot of my code is this one:

gmaker = molgrid.GridMaker()
grid_dims = gmaker.grid_dimensions(molgrid.defaultGninaReceptorTyper.num_types())
#pdb_ids is a list of ~ 16 pdb_ids in a batch given to a model
mols = [next(pybel.readfile("pdb", f"../v2019-other-PL/{pdb_id}/{pdb_id}_pocket.pdb")) for pdb_id in pdb_ids] 
coord_set = [molgrid.CoordinateSet(mol, molgrid.defaultGninaReceptorTyper) for mol in mols]
batch_dims = (len(coord_set), *grid_dims)
batch_grid = torch.zeros(batch_dims, dtype=torch.float32).to(self.device)

for i in range(batch_grid.shape[0]) :
    gmaker.forward(coord_set[i].center(), coord_set[i], batch_grid[i])

I traced back the memory leak to be depending on the CoordinateSet initialization line. I created a script to show that the memory usage can be reproduced using the tutorial molecule. Creating 100 000 times the CoordinateSet for the sdf takes 1.5 GB of RAM, which is fine if the dataset contains i.e. 100 000 small molecules. However, my dataset contains 10 000 protein pockets, which leads to larger memory leaks that prevents me to run the model for multiple epochs.

import molgrid
import os
import psutil
from molgrid.openbabel import pybel as pybel
import matplotlib.pyplot as plt

memory_usages = []
for _ in range(100000) :

    #sdf is the sdf molecule embedded as a string in the tutorial
    mol = pybel.readstring('sdf',sdf)
    coord_set = molgrid.CoordinateSet(mol)

    usage = psutil.Process(os.getpid()).memory_info().rss / 1024 ** 2
    memory_usages.append(usage)

plt.plot(memory_usages)
plt.title('Creating CoordinateSet from the OBMol for a single molecule 100 000 times')
plt.xlabel('Iteration')
plt.ylabel('Memory usage (MB)')
plt.savefig('memory_leak_coordinate_set', bbox_inches='tight')

memory_leak_coordinate_set

A potential fix would be to store the CoordinateSets for all my protein pockets (in a python dict or in pickle files), but I think it would be easier if we were able to compute it on the fly at each train iteration.

Do you know if you can fix this memory leak ? If not, do you have another quick and easy code to grid proteins on the fly, or should I just store the CoordinateSets ?

I use molgrid v0.5.1 installed with pip

Thanks a lot

dkoes commented 3 years ago

Definitely a bug, likely with interacting with the python reference counter. The quickest workaround is to put your protein file names into a file and read it in with ExampleProvider.

dkoes commented 3 years ago

Fix pushed to git.