Closed jacobdurrant closed 5 months ago
That's.. kinda the point of the molcaches.. https://github.com/gnina/scripts/blob/master/create_caches2.py
Hi David. Thanks so much for sending this link. I'll take a close look at it to see how I can implement it in my own code.
Hi David,
I took a look at the link you sent, and it seems that that is for generating a molecache2 file from gninatypes files. My understanding is that gninatypes files are binary files created using the gninatyper executable (i.e., not python). I am not using gninatyper but my own custom PythonCallbackVectorTyper, which I apply to PDB and SDF files that I load "on the fly."
Is there a way to use molgrid to create the equivalent binary types file so I can then create the molcache2 files using code similar to what you posted? Hope this makes sense.
Thanks.
Jacob
Ah, sorry- you are right. I have not implemented support for vector typing in the memory mapped caches.
I think your original idea is most workable - expose the underlying cache and implement data serialization so it can be written out programmatically once you've gone through the dataset. I've added this with 5a642b14087be643f059cb6f7c9f3be8b16cf893
Could you test it out and make sure it behaves as expected?
e = molgrid.ExampleProvider()
e2 = molgrid.ExampleProvider()
e.populate(fname)
e2.populate(fname)
# iterate over all the data to load it into the cache
e.save_mem_caches("foo.cache")
e2.load_mem_caches("foo.cache")
#e2 should now have all the data in memory despite not iterating through it
Hi David,
Thanks so much for your quick help with this. I spent some time yesterday trying to compile molgrid on the CRC and on one of my lab's computers. I ran into a number of challenging issues. Do you happen to have the setup already in place to easily compile it? Sorry for the trouble.
~Jacob
I went ahead and pushed a new release (0.5.4). You should be able to pip update to get the latest changes.
Hi David. Just wanted to let you know that this works great. Thanks for all your help with this. This will speed things up a lot for me! ~Jacob
I'm using molgrid to gridify pairs of receptor (PDB)/ligand (SDF) files in a directory. I'd prefer not to create .molcache2 files ahead of time. But I do use
molgrid.ExampleProvider
'scache_structs=True
to create an in-memory cache during the first epoch. On the second epoch, it runs much faster thanks to the cache.But is there a way to save this in-memory cache to disk, so if I retrain on the same dataset, I don't need to regenerate it even in the first epoch? And is it possible to save each individual protein/ligand pair to it's own cache file from the in-memory cache?
Here's some code that hopefully explains further what I mean:
Thanks for your help with this, and for the great library!