leeping / forcebalance

Systematic force field optimization.
Other
146 stars 75 forks source link

"Engine" concept - a potential major improvement #26

Closed leeping closed 11 years ago

leeping commented 11 years ago

I'd like to bring up an idea that's been rolling around in my head, which if successfully implemented should greatly improve the reusability of the code. However, I'm still puzzling over some design choices, so this is more of a future improvement.

Currently, "Target" classes are implemented in the ForceBalance code using the following inheritance structure:

Target (base class)
    Liquid
        Liquid_OpenMM
        Liquid_GMX
    BindingEnergy
        BindingEnergy_OpenMM
        BindingEnergy_TINKER
    AbInitio
        AbInitio_OpenMM
        AbInitio_GMX
        AbInitio_TINKER

The derived classes Liquid and BindingEnergy correspond to the physical quantities that we would like to fit using our force field. They contain code for reading the reference data files and calculating the objective function from the difference of the reference quantity and the simulated quantity. Since they don't evaluate the simulated quantities, these classes are independent of any MD software that is installed.

The next layer of derived classes Liquid_GMX are like an implementation layer. They contain code to execute the external MD code (or make API calls in the case of OpenMM). For instance, BindingEnergy_TINKER contains calls to the TINKER programs optimize and analyze, in order to optimize the geometries of complexes and calculate their energies with respect to the isolated monomers.

The problem that I want to address is the duplication of code in the implementation layer. For instance, both the classes Liquid_OpenMM and AbInitio_OpenMM contain calls to OpenMM to evaluate the potential energy on a set of structures. It might be possible to restructure the code such that performing a given operation (e.g. evaluating the potential energy on a set of structures) with a given MD code (e.g. OpenMM) only needs to be implemented once.

I have heard bad things about multiple inheritance, so I think a "composition" framework might be a good design. This would introduce a new class that represents the MD code, which I will call the Engine.

The Engine object should contain methods that carry out useful operations in the MD code, which need to be callable without knowledge of which code we're actually using. An early example is already implemented in https://github.com/leeping/forcebalance/blob/master/src/data/npt.py lines 181 through 803.

Once we have these Engine classes, then I imagine that we will no longer need two layers of derived classes for Target classes. Instead, the Target objects will contain Engine objects and call them as needed.

Two outstanding questions / related issues is whether the Engine class should contain any data, and how to handle file dependencies. For instance, in order to minimize the energy of a structure in any MD code (e.g. this would look like TinkerEngine.minimizeEnergy() in the proposed code), one needs to provide the structure. We can either store the structure as data in the Engine object (e.g. as a Molecule object), or the method needs to take the structure as an input argument.

The benefit of storing the structures is that we can count on the Molecule class to write the correct file format for any MD code to use. But where does it make sense to store the structures - in the Target object or the Engine object? There is also a cost associated with storing the structures because they need to be in memory over the course of the calculation, whereas many MD codes simply iterate over the files. For certain targets, it might not be possible to store the structures in memory (e.g. if they correspond to long MD runs.) In these instances, does it make more sense to remember just the file names?

Let me know what you think. I think that once the data storage question is cleared up, I can begin implementing this.

VijayPande commented 11 years ago

I would be tempted to store the structures in the Target class, since it's the base class, and therefore other derived classes would get access. I think it's ok for now to assume that the structures can fit in memory and one can later write code to generalize that to handle disk-based clusters. My guess is that not fitting in memory is pretty rare, right?

PS Curious what Robert thinks about this.

Thanks,

Vijay

Sent from my Phone. Sorry for the brevity or unusual tone.

On Sep 24, 2013, at 10:38 AM, Lee-Ping notifications@github.com wrote:

I'd like to bring up an idea that's been rolling around in my head, which if successfully implemented should greatly improve the reusability of the code. However, I'm still puzzling over some design choices, so this is more of a future improvement.

Currently, "Target" classes are implemented in the ForceBalance code using the following inheritance structure:

Target (base class) Liquid Liquid_OpenMM Liquid_GMX BindingEnergy BindingEnergy_OpenMM BindingEnergy_TINKER AbInitio AbInitio_OpenMM AbInitio_GMX AbInitio_TINKER The derived classes Liquid and BindingEnergy correspond to the physical quantities that we would like to fit using our force field. They contain code for reading the reference data files and calculating the objective function from the difference of the reference quantity and the simulated quantity. Since they don't evaluate the simulated quantities, these classes are independent of any MD software that is installed.

The next layer of derived classes Liquid_GMX are like an implementation layer. They contain code to execute the external MD code (or make API calls in the case of OpenMM). For instance, BindingEnergy_TINKER contains calls to the TINKER programs optimize and analyze, in order to optimize the geometries of complexes and calculate their energies with respect to the isolated monomers.

The problem that I want to address is the duplication of code in the implementation layer. For instance, both the classes Liquid_OpenMM and AbInitio_OpenMM contain calls to OpenMM to evaluate the potential energy on a set of structures. It might be possible to restructure the code such that performing a given operation (e.g. evaluating the potential energy on a set of structures) with a given MD code (e.g. OpenMM) only needs to be implemented once.

I have heard bad things about multiple inheritance, so I think a "composition" framework might be a good design. This would introduce a new class that represents the MD code, which I will call the Engine.

The Engine object should contain methods that carry out useful operations in the MD code, which need to be callable without knowledge of which code we're actually using. An early example is already implemented in https://github.com/leeping/forcebalance/blob/master/src/data/npt.py lines 181 through 803.

Once we have these Engine classes, then I imagine that we will no longer need two layers of derived classes for Target classes. Instead, the Target objects will contain Engine objects and call them as needed.

Two outstanding questions / related issues is whether the Engine class should contain any data, and how to handle file dependencies. For instance, in order to minimize the energy of a structure in any MD code (e.g. TinkerEngine.minimizeEnergy(), one needs to provide the structure. We can either store the structure as data in the Engine object (e.g. as a Molecule object), or the method needs to take the structure as an input argument.

The benefit of storing the structures is that we can count on the Molecule class to write the correct file format for any MD code to use. But where does it make sense to store the structures - in the Target object or the Engine object? There is also a cost associated with storing the structures because they need to be stored in memory over the course of the calculation, whereas many MD codes simply iterate over the files. For certain targets, it might not be possible to store the structures in memory (e.g. if they correspond to long MD runs.)

Let me know what you think. I think that once the data storage question is cleared up, I can begin implementing this.

— Reply to this email directly or view it on GitHub.

leeping commented 11 years ago

Vijay, thanks for your input.

I also think that storing structures in the Target object makes sense (or paths in the file system corresponding to structure files). For simple things like force matching, we are already storing the QM forces anyway, which is an array of similar size. On the other hand, for things like the solvation free energies, we might have a whole trajectory which could be impractical to store in memory.

The efficiency of storing structures in memory vs. on disk also depends on which software is being used to run the calculations. For example, TINKER and GROMACS work with files - if structures are stored in memory, they need to be saved to .arc or .gro prior to calling TINKER or GROMACS. We shouldn't have to write the file every time TINKER / GROMACS is called because the files are the same each time the program is called; we should only need to write it once at most (and that's to write the structures in a compatible file format). On the other hand, OpenMM can directly work with structures stored in memory, so it makes sense to load the whole structure when we can.

VijayPande commented 11 years ago

ok, the Target object could store it both in memory and disk and pass both a pointer and a file locator and let the implementations downstream use whichever they'd prefer?

Thanks,

Vijay

Sent from my Phone. Sorry for the brevity or unusual tone.

On Sep 24, 2013, at 11:00 AM, Lee-Ping notifications@github.com wrote:

Vijay, thanks for your input.

I also think that storing structures in the Target object makes sense (or paths in the file system corresponding to structure files). For simple things like force matching, we are already storing the QM forces anyway, which is an array of similar size. On the other hand, for things like the solvation free energies, we might have a whole trajectory which could be impractical to store in memory.

The efficiency of storing structures in memory vs. on disk also depends on which software is being used to run the calculations. For example, TINKER and GROMACS work with files - if structures are stored in memory, they need to be saved to .arc or .gro prior to calling TINKER or GROMACS. We shouldn't have to write the file every time TINKER / GROMACS is called because the files are the same each time the program is called; we should only need to write it once at most (and that's to write the structures in a compatible file format). On the other hand, OpenMM can directly work with structures stored in memory, so it makes sense to load the whole structure when we can.

— Reply to this email directly or view it on GitHub.

leeping commented 11 years ago

Good idea, that can certainly be done. :8ball:

tmartine68 commented 11 years ago

Seems to me that an engine class should not have data associated with it (neither explicitly nor by file reference). Perhaps you are instead thinking of a container class which contains the "task," engine, and molecular structures? -Todd

On Tue, Sep 24, 2013 at 11:40 AM, Lee-Ping notifications@github.com wrote:

Good idea, that can certainly be done. [image: :8ball:]

— Reply to this email directly or view it on GitHubhttps://github.com/leeping/forcebalance/issues/26#issuecomment-25031274 .

Todd Martinez Ehrsam and Franklin Professor of Chemistry Department of Chemistry Stanford University Stanford, CA 94305-5080 Todd.Martinez@stanford.edu

650-736-8860

rmcgibbo commented 11 years ago

My 2 cents:

I think this is an appropriate usecase for mixins, but composition would be would be fine too. Honestly they're not that different. The mixin API is basically multiple inheritance, but the mixin classes aren't designed to be standalone. Basically the trick is just not to go overboard with multiple inheritence. It's useful in limited doses. You want to be careful about formalizing the interaction between the base classes and the mixins. http://stackoverflow.com/questions/533631/what-is-a-mixin-and-why-are-they-useful

For dealing with molecule data that can either be in memory or on disk, the most elegant thing, IMO, would be a lazy Trajectory object. When you create one from a file on disk, it doesn't actually load the file into memory -- data is only loaded when you actually request it. This is basically how the python netcdf4/hdf5 codes work. The data is obviously cached in memory such that if you repeatedly request the same data, it only hits the disk once. The extension to dealing with different file types is that you can request a "view" of this trajectory as .gro/.arc/etc file. If necessary, the code saves the data to disk for you, but otherwise it can just hand you the filename. This might be over-engineered though -- @VijayPande's solution might be more practical.