openforcefield / openff-toolkit

The Open Forcefield Toolkit provides implementations of the SMIRNOFF format, parameterization engine, and other tools. Documentation available at http://open-forcefield-toolkit.readthedocs.io
http://openforcefield.org
MIT License
318 stars 93 forks source link

Optionally preserve ambertools temporary directory #1931

Open stgeo opened 2 months ago

stgeo commented 2 months ago

Is your feature request related to a problem? Please describe. We've integrated part of the openff-toolkit into a closed-source project and so we need to distribute the AmberTools binaries separately to conform with the GPL. We had some issues when preparing the distribution (paths not set, missing libraries, etc.) and these were difficult to diagnose because the AmberTools wrapper deletes the temporary directory where antechamber is run regardless of whether the system calls were successful or not.

Describe the solution you'd like There should be an option to always keep the temporary directory and ideally the default behavior should be to keep it when antechamber returns a non-zero exit code. Not sure what implementation best fits the philosophy but I suppose it can either be an optional argument in the public methods of the AmberToolsToolkitWrapper (but so far these have the same call signature for all children of ToolkitWrapper), an extra method which configures the behavior, or some global variable.

There are some existing TODOs in the assign_partial_charges code hinting that this is a known issue:

# TODO: Add error handling if antechamber chokes

and

            # Check to ensure charges were actually produced
            if not os.path.exists(f"{tmpdir}/charges.txt"):
                # TODO: copy files into local directory to aid debugging?
                raise ChargeCalculationError(
                    "Antechamber/sqm partial charge calculation failed on "
                    "molecule {} (SMILES {})".format(
                        molecule.name, molecule.to_smiles()
                    )
                )

Describe alternatives you've considered As a workaround, we copied the entire ambertools_wrapper.py into our project and changed the assign_partial_charges method to always keep the temporary directory (diff below). We then manually re-register this patched version into the GLOBAL_TOOLKIT_REGISTRY. This is not ideal since we would need to port upstream changes there (or fork the whole openff-toolkit project to change these few lines).

--- ambertools_wrapper.py   2024-02-28 17:01:13.937524748 +0100
+++ ambertools_wrapper.py   2024-08-30 08:58:22.566703437 +0200
@@ -202,7 +202,8 @@
             )

         # Compute charges
-        with tempfile.TemporaryDirectory() as tmpdir:
+        # Note - we do not delete the temporary directory
+        if tmpdir := tempfile.mkdtemp(prefix='antechamber_'):
             net_charge = mol_copy.total_charge.m_as(unit.elementary_charge)
             # Write out molecule in SDF format
             # TODO: How should we handle multiple conformers?
@@ -212,51 +213,68 @@
             # Compute desired charges
             # TODO: Add error handling if antechamber chokes
             short_charge_method = charge_method["antechamber_keyword"]
-            subprocess.check_output(
-                [
-                    "antechamber",
-                    "-i",
-                    "molecule.sdf",
-                    "-fi",
-                    "sdf",
-                    "-o",
-                    "charged.mol2",
-                    "-fo",
-                    "mol2",
-                    "-pf",
-                    "yes",
-                    "-dr",
-                    "n",
-                    "-c",
-                    str(short_charge_method),
-                    "-nc",
-                    str(net_charge),
-                ],
-                cwd=tmpdir,
-            )
-            # Write out just charges
-            subprocess.check_output(
-                [
-                    "antechamber",
-                    "-dr",
-                    "n",
-                    "-i",
-                    "charged.mol2",
-                    "-fi",
-                    "mol2",
-                    "-o",
-                    "charges2.mol2",
-                    "-fo",
-                    "mol2",
-                    "-c",
-                    "wc",
-                    "-cf",
-                    "charges.txt",
-                    "-pf",
-                    "yes",
-                ],
-                cwd=tmpdir,
-            )
+            with open(f"{tmpdir}/antechamber_1.out", 'wb') as of:
+                try:
+                    output = subprocess.check_output(
+                        [
+                            "antechamber",
+                            "-i",
+                            "molecule.sdf",
+                            "-fi",
+                            "sdf",
+                            "-o",
+                            "charged.mol2",
+                            "-fo",
+                            "mol2",
+                            "-pf",
+                            "yes",
+                            "-dr",
+                            "n",
+                            "-c",
+                            str(short_charge_method),
+                            "-nc",
+                            str(net_charge)
+                        ],
+                        cwd=tmpdir,
+                        stderr=subprocess.STDOUT
+                    )
+                except subprocess.CalledProcessError as err:
+                    output = err.output
+                    raise err
+                finally:
+                    of.write(output)
+            with open(f"{tmpdir}/antechamber_2.out", 'wb') as of:
+                # Write out just charges
+                try:
+                    output = subprocess.check_output(
+                        [
+                            "antechamber",
+                            "-dr",
+                            "n",
+                            "-i",
+                            "charged.mol2",
+                            "-fi",
+                            "mol2",
+                            "-o",
+                            "charges2.mol2",
+                            "-fo",
+                            "mol2",
+                            "-c",
+                            "wc",
+                            "-cf",
+                            "charges.txt",
+                            "-pf",
+                            "yes",
+                        ],
+                        cwd=tmpdir,
+                        stderr=subprocess.STDOUT
+                    )
+                except subprocess.CalledProcessError as err:
+                    output = err.output
+                    raise err
+                finally:
+                    of.write(output)
+
             # Check to ensure charges were actually produced
             if not os.path.exists(f"{tmpdir}/charges.txt"):
                 # TODO: copy files into local directory to aid debugging?
mattwthompson commented 2 months ago

This is up to @j-wags but I'm a little down on this idea, or maybe I don't understand the motivation. Adding more and more keyword arguments to toolkit wrappers rapidly increases the complexity and makes the toolkit harder and harder to test and maintain, so IMHO there should be a high barrier to increasing API surface in these modules. This one in particular would probably only ever interact with AmberToolsToolkitWrapper, but would be passed through the rest of the machinery (Molecule, etc.).

If it's a matter of debugging installation issues, there are plenty of ways that can be handled externally to this. We have a script we sometimes use that reports information about the user's machine and setup, for example. If the installation issues stem from packaging ecosystems other than the one we support, then that's quickly of scope with what we handle. If there are modifications you want to make to how AmberTools is ultimately called, that can be done externally and the charges/conformers/etc. brought to the toolkit with the existing API. (charge_from_molecules, add_conformers, etc.) - these wrappers are meant to comprise a simple and uniform API, not something that goes deep into the nuts and bolts of each underlying tool.

stgeo commented 2 months ago

I see your point - I also don't think this belongs in the API. But each toolkit wrapper has a lot of unique (possibly private) functions outside the uniform API, so I think adding a toggle (e.g. property, possibly private) to AmberToolsToolkitWrapper which decides what to do with the intermediate files is a good option. Then one can instantiate the wrapper, set the toggle as desired, and register it in the ToolkitRegistry (or even get the existing one from registered_toolkits and just set the toggle in place).

Your suggestion to call AmberTools (or rather antechamber) myself and provide the charges manually would of course work, and we may even end up going that route for other reasons.

However, I still think that if a single python function call (in this case assign_partial_charges) produces multiple temporary files as a result of multiple system calls, which may fail not only due to installation issues, but also one of the many possible reasons that the AmberTools binaries may throw an error, it would be helpful to provide the user with a bit more than a generic exception. Giving them access to the files allows you to say "try to figure out what went wrong inside AmberTools by looking at these files" and wash your hands of the matter. There is even an existing TODO in the code with this exact sentiment - I forgot to mention it in the original post but I will add it now:

            # Check to ensure charges were actually produced
            if not os.path.exists(f"{tmpdir}/charges.txt"):
                # TODO: copy files into local directory to aid debugging?
                raise ChargeCalculationError(
                    "Antechamber/sqm partial charge calculation failed on "
                    "molecule {} (SMILES {})".format(
                        molecule.name, molecule.to_smiles()
                    )
                )

To me this looks like the issue was already on your table so I would just like to add my vote in favor :slightly_smiling_face:

j-wags commented 2 months ago

I think adding a toggle (e.g. property, possibly private) to AmberToolsToolkitWrapper which decides what to do with the intermediate files is a good option. Then one can instantiate the wrapper, set the toggle as desired, and register it in the ToolkitRegistry (or even get the existing one from registered_toolkits and just set the toggle in place).

Agree. I think that adding a new attribute like AmberToolsToolkitWrapper.temporary_file_directory would be a good solution here. It would default to None, which would give the current behavior, but if the user set it then the temp files would be kept. It wouldn't interfere with the standard interface, but would let power users customize whether temp files are kept.

I'd be happy to take a PR that does this! The test could run AM1BCC on something really small (like water) and ensure that some expected files are present in the specified directory.

stgeo commented 2 months ago

Thanks, I'll see what I can do about that PR. Since I haven't contributed before and am not familiar with the layout of the project and the requirements for PRs and tests, would you kindly point me towards the relevant guidelines? I've seen this: https://docs.openforcefield.org/projects/toolkit/en/stable/users/developing.html - is there anything more detailed?

j-wags commented 2 months ago

Sure! Unfortunately we don't have better resources on how to contribute, but a general outline for this one would be: