gnina / libmolgrid

Comprehensive library for fast, GPU accelerated molecular gridding for deep learning workflows
https://gnina.github.io/libmolgrid/
Apache License 2.0
144 stars 48 forks source link

Saving in-memory cache to disk #119

Closed jacobdurrant closed 5 months ago

jacobdurrant commented 6 months ago

I'm using molgrid to gridify pairs of receptor (PDB)/ligand (SDF) files in a directory. I'd prefer not to create .molcache2 files ahead of time. But I do use molgrid.ExampleProvider's cache_structs=True to create an in-memory cache during the first epoch. On the second epoch, it runs much faster thanks to the cache.

But is there a way to save this in-memory cache to disk, so if I retrain on the same dataset, I don't need to regenerate it even in the first epoch? And is it possible to save each individual protein/ligand pair to it's own cache file from the in-memory cache?

Here's some code that hopefully explains further what I mean:

import molgrid
import time

# Let's use a custom atom typer
def typer(atom):
    """Typers an atom and returns a tuple of floats and a radius."""

    type_vec = None
    if hasattr(atom, "GetHeteroValence"):
        type_vec = [atom.GetAtomicNum(), atom.GetHeteroValence()]
    else:
        type_vec = [atom.GetAtomicNum(), atom.GetExplicitValence()]
    return (type_vec, 1.5)

t = molgrid.PythonCallbackVectorTyper(typer, 2, ["anum", "valence"])

# Create the data set, with cache_structs=True
dataset = molgrid.ExampleProvider(t, default_batch_size=16, cache_structs=True)

# Populate the dataset
dataset.populate("./out/train.types")

# Iterate through the (small) dataset once. Let's see how long it takes.
t1 = time.time()
dataset.next_batch()
dataset.next_batch()
dataset.next_batch()
print(time.time() - t1)

# It took 4.6 secs

# Now iterate again. It should take a shorter time, because cache_structs=True.
dataset.reset()
t1 = time.time()
dataset.next_batch()
dataset.next_batch()
dataset.next_batch()
print(time.time() - t1)

# It took 0.002 secs.

# It was much faster the second time, showing it was cached. But can I now save
# this in-memory cache to disk somehow, so I don't have to regenerate it every
# time I run the program (even if only on the first epoch)? Also, could I save
# the cache per input protein/ligand pair (rather than saving the whole dataset
# to a single file)?

Thanks for your help with this, and for the great library!

dkoes commented 6 months ago

That's.. kinda the point of the molcaches.. https://github.com/gnina/scripts/blob/master/create_caches2.py

jacobdurrant commented 6 months ago

Hi David. Thanks so much for sending this link. I'll take a close look at it to see how I can implement it in my own code.

jacobdurrant commented 6 months ago

Hi David,

I took a look at the link you sent, and it seems that that is for generating a molecache2 file from gninatypes files. My understanding is that gninatypes files are binary files created using the gninatyper executable (i.e., not python). I am not using gninatyper but my own custom PythonCallbackVectorTyper, which I apply to PDB and SDF files that I load "on the fly."

Is there a way to use molgrid to create the equivalent binary types file so I can then create the molcache2 files using code similar to what you posted? Hope this makes sense.

Thanks.

Jacob

dkoes commented 6 months ago

Ah, sorry- you are right. I have not implemented support for vector typing in the memory mapped caches.

I think your original idea is most workable - expose the underlying cache and implement data serialization so it can be written out programmatically once you've gone through the dataset. I've added this with 5a642b14087be643f059cb6f7c9f3be8b16cf893

Could you test it out and make sure it behaves as expected?

dkoes commented 6 months ago
    e = molgrid.ExampleProvider()
    e2 = molgrid.ExampleProvider()

    e.populate(fname)
    e2.populate(fname)

   # iterate over all the data to load it into the cache 

   e.save_mem_caches("foo.cache")

   e2.load_mem_caches("foo.cache")
   #e2 should now have all the data in memory despite not iterating through it
jacobdurrant commented 6 months ago

Hi David,

Thanks so much for your quick help with this. I spent some time yesterday trying to compile molgrid on the CRC and on one of my lab's computers. I ran into a number of challenging issues. Do you happen to have the setup already in place to easily compile it? Sorry for the trouble.

~Jacob

dkoes commented 6 months ago

I went ahead and pushed a new release (0.5.4). You should be able to pip update to get the latest changes.

jacobdurrant commented 5 months ago

Hi David. Just wanted to let you know that this works great. Thanks for all your help with this. This will speed things up a lot for me! ~Jacob