openforcefield / openff-bespokefit

Automated tools for the generation of bespoke SMIRNOFF format parameters for individual molecules.
https://docs.openforcefield.org/bespokefit
MIT License
61 stars 9 forks source link

The future of BespokeFit #224

Open Yoshanuikabundi opened 1 year ago

Yoshanuikabundi commented 1 year ago

I'm opening this issue to lay out what I see as the challenges for BespokeFit and hopefully get some advice on clearing them up and planning the next several releases.

I think the next release is uncontroversial, as we've already had one crack at it: 0.2.0, which introduces support for versions of the Toolkit since 0.11. I think we want to release this as soon as we think it works, which is presumably either when ForceBalance makes a release fixing OpenFF interop, or when OpenFF ForceBalance makes its first release. I'd like to discuss what happens after that.

Testing - Making BespokeFit easier to develop

BespokeFit is an impressive piece of code, but I think we owe some substantial technical debt on it. Significant PRs are commonly bogged down with test failures that we can't diagnose and end up working around, fits fail even when unit tests pass, and changes that feel simple end up complicated. I've discussed integration tests elsewhere (#218), which are relatively simple to implement and will provide a huge improvement. However, if we want to be able to continue to confidently release new features for BespokeFit, I think we also need to rethink how we're doing unit tests.

Documenting existing tests

Only about a quarter to a third of our tests have docstrings. BespokeFit's tests are pretty difficult to interpret by just reading the code. There's an average of about two comment lines every three tests, tests often include unfamiliar (at least to me) things like HTTP post requests, and they require a lot of knowledge about the structure of the software which is also undocumented - more on that later. It'd be great to get at least single-line docstrings for all tests, and detailed docstrings for mock-ups and fixtures that are used repeatedly in tests.

Enhanced mock-ups for external functions

Unit tests for BespokeFit are complicated by the need to call out to other software, which must then do complex and time-consuming calculations before returning. A unit test (as I understand it) is supposed to be a quick-but-comprehensive check of a small, isolated part of the codebase; if all the unit tests pass, then all the underlying functionality works, and so the code in aggregate should work as specified by the unit tests. When a single run of a function can take minutes, it becomes impossible to test things comprehensively in a practicably short time period. And even if it were fast, our unit test basically becomes an integration test for someone else's project.

At the moment, this is worked around by mocking all the expensive calculations. Instead of calling out to ForceBalance for instance, we monkey patch subprocess.run() to return None. I think this is the right approach, but it could be improved by investing some time in making mock-ups that confirm that the inputs they are provided are what we expect, and then produce credible output. Some of our existing mock-ups do this, but most seem to do only one, the other, or neither. I think this might lead our tests to tend to answer the question, "does this invocation of this function work", rather than "do all invocations of this function have the expected behavior".

Refactoring

I have a number of ideas for refactors to set up clearer responsibilities for different parts of the code base and generally flatten the call graph.

There might be good reasons that the code base is the way it is, but I think it might just be the result of time crunch to get the code ready for publication. But if any of this sounds like a bad idea, please let me know!

Moving logic out of workers

This might be me just not understanding the logic behind how the code is laid out, but I don't understand when code should be put in a worker (eg openff/bespokefit/executor/services/optimizer/worker.py) vs when it should be put in its own module (eg openff/bespokefit/optimizers). It's clear to me that the worker provides an opportunity for code that runs regardless of which optimizer is being used, but the optimizers module also provides this opportunity, both in the base model and in plain functions in the base module.

What is clear to me is that when something goes wrong, or when I want to add a feature, or I want to understand how a configuration option works, at the moment I have to check at least two places: the worker and the model. I remember this meant an awful lot of jumping from source file to source file and two or three prototype fixes in completely different parts of the codebase when I wrote #193.

I'd like to move as much logic out of the workers as possible. Ideally, I'd like the different workers to share most of their code, and just consist of shared logic to:

  1. Determine whether to compute the data or retrieve it from cache
  2. Call the step module's compute function
  3. Store the produced data in the cache

And for all other logic to happen within the step module's compute function. This'll flatten the module hierarchy a lot, and move all optimizer logic to the optimizers module, fragmentation logic to the fragmentation module, and so forth. It'll hopefully also simplify testing, as the workers can be tested in aggregate and the logic for each step will be co-located.

I think similar refactors may be possible in other parts of the codebase, but I haven't found any yet!

Simplification of configuration schemas

BespokeWorkflowFactory is a complex class whose philosophy seems to have wandered over development. It is complex by nature and by design, and that's fine, but I think we can unify the philosophy to make it a little less complicated.

At the moment, I want to make the following changes in this and other schemas:

In general, I would like

New user-facing documentation

Docstrings for all modules, classes and functions would be great, as would descriptions for all environment variables.

Also, does anyone know why the environment variables are prefaced with BEFLOW_ and not BESPOKEFIT_ or something?

Better debugging

Debugging BespokeFit is a pain. I have two changes in mind that should help this:

  1. 199 is languishing with a confusing error message, but it's achievable. Being able to keep all the files we create in a fit regardless of whether the fit succeeded will be super useful for debugging

  2. This is much more minor, but it would be good to bulk up the error reporting machinery to format things more clearly, rather than just reporting eg. the traceback as a string in a json record.

How to stage all this

I think 0.2.0 comes first. Then I think we at least need integration tests that check scientific validity before we can attempt major refactors, and improved debugging would be a big help. But apart from that I have no concrete plans yet.

I also have no plans for new features. But I think the above will make BespokeFit much easier to use and develop and provide a good base for new features. At the moment, the idea of adding features is... intimidating.

Dear reader, whether you got this far or you just skipped to the end, I would very much appreciate your thoughts!

mattwthompson commented 1 year ago

A unit test (as I understand it) ...

This follows the definition but our community does not follow the definition. In principle there should be

Where possible I like to split things into tests/unit_tests vs other subdirectories. This is uncommon, though.

Mocking is powerful but in my experience makes it harder to follow what's going on, or what's intended to go on. I'm ambivalent on documenting self-contained unit tests (usually something that's hard to read should be split up or it's not really a unit test) but my experience trying to understand tests with several mocked components would be improved with some more documentation or comments.

As a developer, I'd like to see better support for temporary files so I can, for example, tinker with the optimization step without needing to re-run the fragmentation and QC steps. I also wonder if the current workflow (collect inputs, fragment, QC, optimize) will change at some point in the future. (i.e. a neural net can mimic QC well enough that fragmentation is not necessary, or and extra step needs to be added into this workflow for reasons not currently imaginable to us).

jthorton commented 1 year ago

Hi @Yoshanuikabundi thanks for spending the time looking into this Ill try and respond to the bits I can help with but overall I agree that a rework would make development easier! As a few PRs like chained tasks #167 have been stuck for a while and you are correct that this was due to a time crunch that meant the science part of the workflow I developed and the backend stuff Simon made was quickly pushed together (hence random name changes between bespoke and beflow)!

There's an average of about two comment lines every three tests, tests often include unfamiliar (at least to me) things like HTTP post requests, and they require a lot of knowledge about the structure of the software which is also undocumented

These mocked tests are something Simon was using a lot as many functions in bespokefit take a lot of time or rely on calling out to QCArchive which can be unreliable, I mostly understand them now and can write some doc strings to explain at least what I think is going on if that would help? I also agree that the use of mocked tests is inconsistent and it would be great to come up with a pattern they should follow that we could apply to them all!

Moving logic out of workers

This has also confused me, and the issue is made more complex by the flexibility I have tried to offer in the workflow. The way I have it is that logic should go into its own module where possible as is the case for the optimizers such as ForceBalance which have the optimize method which does the work of the stage. Ideally, that schema would have some hash function and the worker checks for the hash in the redis database if it is not present it launches the task and caches the id and returns the complete/ running task id.

However, the fragmentation stage is more complicated than this as @mattwthompson points out above what if the user does not want to use fragmentation? So we made this stage optional allowing users to set it to None in the workflow schema but we still need to select torsions to be driven based on the target torsion smirks and so this logic ended up in the worker.

Maybe now we have more time we could come up with a better method maybe some schema NullFragmenterwhose fragment function just mocks the output of the fragmentation stage targeting the correct torsions? This would allow us to move all of the logic to that module and share the logic with the optimisation stage.

Stages like the QCGeneration were planned to be more complicated as users could compose different methods to run pre-optimisations with say xtb then use ANI2x or QM to do single points on these final geometries or even run a full torsiondrive from the pre-optimised starting geometry. This could also be moved to the module level and follow the pattern we want for the other stages but I couldn't see how the cache could be made to work on each individual task level, for example, if a user ran an xtb torsiondrive and wanted single points with ANI2x and then with full QM they should be able to reuse the xtb scan. I think extending this for more types of workflows will be difficult but if we could make it work then we can move all/most of the logic out of the workers!

Simplification of configuration schemas

Yes changing these seems fine to me if it helps make it easier to understand and use, I think aiming to simplify the construction of the BespokeWorkflowFactory would be great as this is the only part of the API along with the configurable stage settings that users should ever interact with as we envisioned most would use the CLI to execute the workflow.

I think one big issue is how interlinked settings are for example the target_templates are very specific to the ForceBalance optimizer and it is not clear to me that other optimizers would have settings like this so maybe they should be absorbed into that schema. In the same way, the QCGeneration method (torsiondrive optimisation, hessian) is tied to the target and is within the TargetSchema this would make the issue of multiple layers worse however, but would make validation easier as only recognised targets could be set to the ForceBalance optimizer?

Overall we tried a few iterations of the layout and got this one worked but its probably not optimal and I would be interested to hear thoughts on how we could simplify things and make it easier to extend as I have wanted to add other optimisers for a while now but know its a big task...

I agree with the rest of your points and would love to see it become easier to add new features as I have plenty of ideas some even made it to PRs #167 but it takes far too long to get them over the line.

mattwthompson commented 1 year ago

An example of why we might want to avoid mocking subprocess.run: it's used by other tools out there.

Yoshanuikabundi commented 1 year ago

Thanks for your inputs! Sorry I missed that you'd replied for so long.

@mattwthompson:

In principle there should be

I'm into that! I guess I blur the line between regression tests and the others a bit; if there's a bug that's missed by the tests, that seems to me to also be a bug in the existing tests, so the appropriate regression test is just a unit (or integration I guess) test of that behavior.

Mocking is powerful but in my experience makes it harder to follow what's going on, or what's intended to go on.

Yeah if we can find a way to test BespokeFit without mocking, I would love that - but it probably means essentially writing a test suite for the inputs we provide to other software. That would probably pay off, but it seems like an enormous amount of work given our current approach of "forward as many options on to our dependencies as possible".

I'm ambivalent on documenting self-contained unit tests (usually something that's hard to read should be split up or it's not really a unit test)

I'd like to at least see a description of the behavior the test is intended to check. If that description fits into the name of the function, that's fine, but I think that either means that it's a good unit test with a focused goal (yay!) or that the description is vague and the test covers a lot of functionality (like test_myfunc_works() for example). In practice, I've found that even simple tests benefit from comments that document the assumptions and logic that the test author is using, and docstrings that describe all of the checked behavior.

@jthorton:

(hence random name changes between bespoke and beflow)!

I guess we'll just have to rename the configuration options then! Big breaking change coming up, I sense.

I mostly understand them now and can write some doc strings to explain at least what I think is going on if that would help?

THAT WOULD BE AWESOME!!!

but we still need to select torsions to be driven based on the target torsion smirks and so this logic ended up in the worker.

This sounds like a problem with target_torsion_smirks to me. Most of the other settings are neatly scoped to the stage in which they're used, but target_torsion_smirks is needed at every stage. Logically it configures the optimization target, but that ripples backwards to affect QCGen and fragmentation. So maybe we just need to put it in the target selection schema, and have the fragmentation stage look there to choose the fragments that are needed - just as the QCGen stage looks there to choose what torsions to drive. Or maybe we create an initialization stage that computes everything for the later stages to do, or even move this logic into BespokeOptimizationSchema generation. In any case, I think we can get this logic out of the workers.

target_torsion_smirks also stands out to me as a sore point in the configuration because its unclear what will happen to it if we ever implement non-torsion targets, and a user wants to use them and skip the torsion drive. So it's definitely one of the things I wanted to rethink.

Stages like the QCGeneration were planned to be more complicated as users could compose different methods

But the API doesn't actually support this at the moment, right? My gut thinking is that we shouldn't compromise making the codebase easily understandable for a feature that we might develop in the future.

I couldn't see how the cache could be made to work on each individual task level

I'm not sure there's an API to actually make use of it, and I might be misunderstanding things... but does #76 support essentially arbitrary stage orderings? Each task could be its own QCGen stage. Then we'd get all of that for free right? And #167 would just be adding an optional second QCGen stage.

this is the only part of the API along with the configurable stage settings that users should ever interact with as we envisioned most would use the CLI to execute the workflow.

Ok cool this is great info. In that case I think we should make (most of?) the rest of the API private and we can just create a user friendly front-facing Pydantic model with the design we want, that gets converted to the existing (private) configuration code behind the scenes. I guess that's basically what BespokeWorkflowFactory already is; let's lean into it. Then we can just have one big breaking change and the behind-the-scenes stuff can be simplified at our leisure.

I think one big issue is how interlinked settings are

Yeah agreed, as I mention with target_torsion_smirks above. The design of the new BespokeWorkflowFactory is obviously going to take some thinking!

Everything basically comes back around to targets, right? You could basically express the whole configuration as just a list of targets, and then each target defines what part of the molecule it works on, how to fragment it, what QCGen to do on it, and how to optimize it. When the workflow factory is applied to a particular molecule, this all could get "compiled" into the stage model we have at the moment. And we could have other configuration classes that inherit from it, like say SMIRKSTorsionWorkflowFactory, which allows all those settings to be specified in one step by specifying a SMIRKS pattern and a set of QCGen etc settings that get applied to all targets - like we have at the moment. That'd let us break apart the common case of optimizing torsions with the existing tools from a generalised configuration for any future possibilities.

Or something. It feels like the sort of thing we would want to experiment with for a bit, which I now realise we can do by adding additional experimental configuration classes. So I guess I'll get around to experimenting with that and finding out how hard it is at some point in the future!

xperrylinn commented 1 year ago

Thanks for sharing this @Yoshanuikabundi! I'm new to Bespokefit and still learning the API. Your analysis provides some nice context by discussing strengths, opportunities for improvement, and potential future directions.

I'm chiming in because I'm starting to work on a master's program capstone project that is focused on integrating re-parameterization into the Material Project HMD workflow stack. My advisor has decided that Bespokefit is a good option and we'd like to pursue incorporating it into the workflow. There are a couple of approaches that we are considering taking. The first is to see if we can use the Bespokefit API directly or create a wrapper to encapsulate functionality. The second option was to explore the possibility of seeing if Bespokefit could be configured to interface with pymatgen and atomate for the QC generation step of the workflow. You mentioned there are other places in the codebase that could be refactored. Has there been any thought on being able to create interfaces for QC generation?

mattwthompson commented 1 year ago

BespokeFit itself calls QCEngine to handle the heavy lifting of QC calculations. If you are looking to interface with QC directly you may want to look into what QCEngine handles and use it directly; it would make more sense to wrap BespokeFit if you have a whole end-to-end pipeline in mind.