gnina / scripts

BSD 3-Clause "New" or "Revised" License
23 stars 83 forks source link

How to create cache/types file for multiple receptors and ligands for PDBbind dataset? #51

Closed JonasLi-19 closed 1 year ago

JonasLi-19 commented 1 year ago

How to create cache file for multiple receptors and ligands (they are in gninatypes format)

1.I know I can use gninatyper to transfer ligands and receptors into gninatypes, but I do not know exacly how to use create_cache2py to generate cache file for a pair of ligand and receptor(they are in a independent dir named by PDBid, and definitly there are docked poses and crystal ligands). 【BTW, is it necessary to transfer them into cache rather than types file concerning with the time consument?】

  1. And I also wonder is it possible to generate a cache file(or a types file) for the whole dataset just like ccv_*.types file, which is one train/test file containing thousands of complexes. Is this job need several cross-validate scripts pipeline?

Specially, I have no idea how to add the rmsd and affnity into the types or cahe file, are they call for csv files?

Here is the cache2,py description, without indicating what the -fname is, without mentioning about how to add rmsd and affnity data to the file line. '''Takes a bunch of types training files. First argument is what index the receptor starts on (ligands are assumed to be right after). Reads in the gninatypes files specified in these types files and writes out two monolithic receptor and ligand cache files in version 2 format. Version 2 is optimized for memory mapped storage of caches. keys (file names) are stored first followed by dense storage of values (coordinates and types). '''

dkoes commented 1 year ago

The caches are for faster training. It is much much more efficient to memory map a single large file than open and close many small files. The "types" file is a list of all the examples in your training set with their labels, like this: https://raw.githubusercontent.com/gnina/models/master/data/PDBBind2016/General_types/ccv_gen_norec_uff_0_test0.types

francoep commented 1 year ago

The cache is for the structures (atomic positions & atom types) of the receptor+ligand pairs. The labels of the pose quality and binding affinity are in the "types" file, as David said.

You absolutely can generate a cache for an entire set. We provide caches for CrossDocked2020, which has some 22.5million poses within it. To generate a cache, you first need to generate the correct types files, which is it's own pipeline and process. Then you can provide the types file(s) as input to create_caches2.py in order to make your own custom cache from your own custom types files.

JonasLi-19 commented 1 year ago

Thanks for your reply, I have learned a lot! Now I wan to make sure that: If I train the model on types file, there is no need to change or remove the following in the model, right?

        ligmolcache: "LIGCACHE_FILE"
        recmolcache: "RECCACHE_FILE"

The caches are for faster training. It is much much more efficient to memory map a single large file than open and close many small files. The "types" file is a list of all the examples in your training set with their labels, like this: https://raw.githubusercontent.com/gnina/models/master/data/PDBBind2016/General_types/ccv_gen_norec_uff_0_test0.types

The cache is for the structures (atomic positions & atom types) of the receptor+ligand pairs. The labels of the pose quality and binding affinity are in the "types" file, as David said.

You absolutely can generate a cache for an entire set. We provide caches for CrossDocked2020, which has some 22.5million poses within it. To generate a cache, you first need to generate the correct types files, which is it's own pipeline and process. Then you can provide the types file(s) as input to create_caches2.py in order to make your own custom cache from your own custom types files.

francoep commented 1 year ago

You need to modify those lines. Currently, as written, Caffe will attempt to find a file on your computer called LIGCACHE_FILE which will fail, and the process will likely crash.

If you want to use a cache, you have to change LIGCACHE_FILE and RECCACHE_FILE to the name of the corresponding cache files on your machine. If you do NOT want to use a cache, then you can delete those lines from the caffe model file.

Note: there are two instances of these lines in our provided .model files -- one for the test and one for the train.