The executor fails with multiple ligands

Dear members of the bespokeFit community,

I need your help/advice. I have a problem processing multiple ligands (ligand sets in parallel) with BespokeFit. Jobs often fail, usually at the Optimization stage (see the error message below). The failed task can usually be finished if it is restarted, or if the ligand is processed as a single BespokeFit job. I tested BespokeFit with the fep-benchmark set and with proprietary ligands, mostly with the semiempirical xtb code, and few with psi4.

I follow the @jthorton 's advice given here previously (https://github.com/openforcefield/openff-bespokefit/issues/215).

I have a feeling that the issue has something to do with memory management. I gave the scripts I use to a colleague of mine and his jobs failed multiple times on a Linux machine he uses as a personal computer.

I would appreciate any advice you can share.

I run BespokeFit with 1 or 2 cores (with psi4 jobs I use 4):

with BespokeExecutor(
    n_fragmenter_workers = 12,
    n_optimizer_workers = 40,
    n_qc_compute_workers = 40,
    qc_compute_worker_config=BespokeWorkerConfig(n_cores=2)
) as executor:
    # Submit our workflow to the executor
    ids = []
    for optimization_schema in workflow_schema:
        task_id = executor.submit(optimization_schema)
        ids.append(task_id)

When the jobs are done I extract individual FFs with the next code:

import sys
from openff.bespokefit.executor import BespokeExecutor
import pandas as pd
from simtk import unit

id = sys.argv[1]
output = BespokeExecutor.retrieve(optimization_id=id)

print(output.status)
print(output.error)
print(id)
print("====================================================")

if output.error is None:
   force_field = output.bespoke_force_field
   force_field.to_file("force_field_from_QMM_id_{0}.offxml".format(id))

   print(getattr(output.results.input_schema, 'target_torsion_smirks'))

   parameter_data = []
   for i, (parameter, initial_values) in enumerate(output.results.input_schema.initial_parameter_values.items()):
      for final_parameter, final_values in output.results.refit_parameter_values.items():
         if parameter.smirks == final_parameter.smirks:
            print(parameter.smirks) 
            for term in range(1, 5):
              k_before = initial_values[f"k{term}"].value_in_unit(unit.kilocalorie_per_mole)
              k_after = final_values[f"k{term}"].value_in_unit(unit.kilocalorie_per_mole)
              parameter_data.append([f"smirks_{i}_k{term}", k_before, k_after, k_after - k_before])

   df = pd.DataFrame(parameter_data, columns=["parameter", "before", "after", "change"])
   print(df)
   df.to_csv("optimized_parameters_id_{}.csv".format(id),index=False)

and merge them with

rm cffs.sh
echo "openff-bespoke combine --output \"combined_bespokefit_forcefield_for_cdk8_cmet_eg5_pfkfb3.offxml\"  \\"|tee cffs.sh

for filename in force*.offxml; do
    echo $filename
    echo "--ff \"${filename}\"               \\" &>> cffs.sh 
done

chmod +x cffs.sh
./cffs.sh

Error message:

{"type": "RuntimeError", "message": "ConvergenceFailure: The optimization failed to converge.\nNone", "traceback": "Traceback (most 
 recent call last):\n  File \"/home/ognjen/anaconda3/envs/bespokefit/lib/python3.8/site-packages/celery/app/trace.py\", line 451, in 
 trace_task\n    R = retval = fun(*args, **kwargs)\n  File                                                                           
 \"/home/ognjen/anaconda3/envs/bespokefit/lib/python3.8/site-packages/celery/app/trace.py\", line 734, in __protected_call__\n       
 return self.run(*args, **kwargs)\n  File                                                                                            
 \"/home/ognjen/anaconda3/envs/bespokefit/lib/python3.8/site-packages/openff/bespokefit/executor/services/optimizer/worker.py\", line
 74, in optimize\n    raise (\nRuntimeError: ConvergenceFailure: The optimization failed to converge.\nNone\n"}

Hardware:

Architecture:                    x86_64
CPU op-mode(s):                  32-bit, 64-bit
Byte Order:                      Little Endian
Address sizes:                   46 bits physical, 48 bits virtual
CPU(s):                          94
On-line CPU(s) list:             0-93
Thread(s) per core:              1
Core(s) per socket:              94
Socket(s):                       1
NUMA node(s):                    1
Vendor ID:                       GenuineIntel
CPU family:                      6
Model:                           85
Model name:                      Intel(R) Xeon(R) Platinum 8168 CPU @ 2.70GHz
Stepping:                        4
CPU MHz:                         2693.670
BogoMIPS:                        5387.34
Virtualization:                  VT-x
L1d cache:                       3 MiB
L1i cache:                       3 MiB
L2 cache:                        376 MiB
L3 cache:                        16 MiB
NUMA node0 CPU(s):               0-93
....

OS:

NAME="Ubuntu"
VERSION="20.04.5 LTS (Focal Fossa)"
ID=ubuntu
ID_LIKE=debian
PRETTY_NAME="Ubuntu 20.04.5 LTS"
VERSION_ID="20.04"

Thanks for writing in. I don't have an immediate workaround but @jthorton or @Yoshanuikabundi might chime in.

One rough tip might be to set the convergence criteria to something less strict, like GAU_LOOSE, like I did here, though being academic-adjacent I have to include the disclaimer that "you'll then be using a method that differs from the published results" (though I doubt that matters in a real project).

Could you clarify whether the result retrieval fails for all molecules, or just the ones with convergence failures?

Since psi4/xtb/qcengine tools are upstream we can't directly implement fixes for them, and if the convergence failures are indeed due to randomness stemming from memory availability, successes and failures could both be valid outputs for those tools for the same input.

So I think this will be a longer-term usability improvement on our end - There are a number of things that can cause a bespokefit run to fail, which can be categorized on a spectrum from "controlled failures" (like raising an error because bonds broke in QM or because elements are out of scope for downstream tools) to "uncontrolled failures" (like if a single unconverged optimization brings down the entire database).

Controlled failures are intentional and are should come up in situations where we want to communicate that users are better off using the normal FF parameters than the outputs of bespokefit. We may want to document these better so that people like you who make workflows know what to expect and can choose how to handle them.

So I think there are a few actions we can take here:

Enumerate the common high-level errors that bespokefit can raise in the docs, so workflow makers can choose how to triage them
Maybe figure out how to "jiggle the knob" on unconverged optimizations when restarting to figure out how to give them a better chance to succeed upon rerunning.
Document general tips/common issues related to computer performance and input characteristics

cc #224

Hey Jeff - thanks for the quick response!

I just want to add that this is currently a pretty big roadblock for us benchmarking/testing this at redesign. The reason we think it's a memory / thread allocation problem is that apart from the error message above, we've also frequently observed hung jobs and seg-faulted jobs that complete upon restarting. Additionally, if we provide a list of ~10 ligands it will fail, but splitting the 10 ligands into 2 file folders with ~5 ligands each and running two instantiations on the same hardware concurrently leads to both completing without issue. These are not particularly strange ligands with poor geometries, they're from FEP benchmark sets with crystal or crystal-adjacent poses. Additionally the hardware that we're testing on should have plenty of resources, 96 cores. But if there's a specific hardware config that's known to work we can probably get whatever is needed.

I wonder if there's some sub-optimized or hardware agnostic celery calls that are working in the test environment but failing /causing issues on our end? If there's a debugging protocol we could run on our end maybe that would tell us more. Regardless, in the current set up it takes up enough time / babysitting that unfortunately we can't use it on any practical benchmarking. The few sets that we've completed have pretty significant changes and have provided interesting insights, so we're keen on getting this scaled up.

Thanks! Gavin

Hi all,

I think this is an issue with the ForceBalance optimisation stage (going by the error message) and unfortunately, the error reporting is minimal as this does not often fail to converge. Ideally, we want the forcebalance log file to be printed or saved somewhere for better debugging we could look into adding that.

If this is a memory issue you could try setting the max_memory in the BespokeWorkerConfig see here this is the max memory per core the worker can use if this is not provided then each worker can use all of the memory, its best to set this for the qc_compute_workers as they can use a lot and could cause other jobs to fail.

There should also be some log files in a directory called bespoke-executor which might give more information on whats gone wrong.

Thank you @jthorton. For my runs, the main problem is the ForceBalance stage. The fragmentation phase and qc-generation phase (XTB) are straightforward and finish calculations in an hour or so (once qc-generation failed, but repeated run produced data). However, during the Forcebalance phase, a job will either finish the run or fail. Some jobs finish the run in an hour or so, but some take hours (once it took more than 12 hours to finish processing), and sometimes the ForceBalnce routine enters an infinite loop and runs forever (I have a job running for two days by now).

Here is a typical error message:

openff-bespoke executor watch --id 67

────────────── OpenFF Bespoke ───────────────────

[✓] fragmentation successful
[✓] qc-generation successful
[x] optimization failed

 {"type": "RuntimeError", "message": "ConvergenceFailure: The optimization failed to converge.\nNone", "traceback": "Traceback (most 
 recent call last):\n  File \"/home/ognjen/anaconda3/envs/bespokefit/lib/python3.8/site-packages/celery/app/trace.py\", line 451, in 
 trace_task\n    R = retval = fun(*args, **kwargs)\n  File                                                                           
 \"/home/ognjen/anaconda3/envs/bespokefit/lib/python3.8/site-packages/celery/app/trace.py\", line 734, in __protected_call__\n       
 return self.run(*args, **kwargs)\n  File                                                                                            
 \"/home/ognjen/anaconda3/envs/bespokefit/lib/python3.8/site-packages/openff/bespokefit/executor/services/optimizer/worker.py\", line
 74, in optimize\n    raise (\nRuntimeError: ConvergenceFailure: The optimization failed to converge.\nNone\n"}

I checked the bespoke-executor directory and did not find much. The above message appears in the celery-optimizer.log file. The massage indicates that the issue may be with the celery tool.

I had a brainstorming session on this issue with colleagues at redesign and we came to the conclusion that we would like to try new strategies in attempt to avoid executor/celery/redis issues. So I have a few questions/proposals:

We can run a separate executor on each molecule and then join the outputs into a single FF (via the script posted above). This approach is less efficient as some of the QC/ForceBalance calculations will be repeated (if molecules share fragments), but will be, I guess, more stable.
Can we run the fragmenter as a separate task and save its output as a set of structures (e.g., a set of sdf files)? We can then apply the set function on those fragments to obtain a minimal set of fragments (if molecules share some fragments). That final set can be processed with an independent executor for each fragment (run an independent copy of the redis database/celery for each fragment). Finally, we can combine the outputs into a single FF.
Can we use the bespokefit environment to run each step separately without invoking the executor? The idea is to run the fragmenter first, then the QC code and then ForceBalance without relying on executor/celery/redis?

Hi @ognjenperisic,

The new version of BespokeFit (0.2.1) includes an option to preserve more information from the ForceBalance run - try running the Executor with the BEFLOW_KEEP_TMP_FILES environment variable set.

You can definitely run the fragmenter directly! Something like

import openff.bespokefit.executor.services._settings # Workaround for bug I just discovered
from openff.bespokefit.workflows import BespokeWorkflowFactory
from openff.fragmenter.fragment import WBOFragmenter
from openff.toolkit import Molecule

parent = Molecule.from_smiles("CNC(=O)C1=NN=C(C=C1NC2=CC=CC(=C2OC)C3=NN(C=N3)C)NC(=O)C4CC4")

workflow_factory = BespokeWorkflowFactory()
fragmenter = workflow_factory.fragmentation_engine # or instantiate WBOFragmenter directly
target_torsion_smirks = workflow_factory.target_torsion_smirks

fragmentation_result = fragmenter.fragment(parent, target_bond_smarts=target_torsion_smirks)

fragment_molecules = [fragment.molecule for fragment in fragmentation_result.fragments]

See https://docs.openforcefield.org/projects/fragmenter/en/stable/api.html

Unfortunately I'm not aware of any good way to run individual stages. You could try running the celery workers directly, but they're not documented and take raw json strings as input so I wouldn't recommend it.

Hi @Yoshanuikabundi ,

thank you very much for your assistance. I managed to run BespokeFit (xtb code) in individually, in parallel by setting the BEFLOW_OPTIMIZER_KEEP_FILES, BEFLOW_GATEWAY_PORT, and BEFLOW_REDIS_PORT environment variables. Here is how I had done that. First, I prepared a setup file (parameters.yaml) like this:

BEFLOW_OPTIMIZER_KEEP_FILES: True
BEFLOW_GATEWAY_PORT: 8000
BEFLOW_REDIS_PORT: 6363
directory: /home/ognjen/...../ligands
ligand_filename: ligands.sdf
ligand_index: 5

The ligand_index variable has a double role. It is used to read a ligand from the sdf file, but it also adds an integer value to the BEFLOW_GATEWAY_PORT and BEFLOW_REDIS_PORT environment values, see below:

try:
   with open("parameters.yaml", "r") as file:
       params = yaml.load(file, Loader=yaml.FullLoader)

   filename = params['ligand_filename']
   del params['ligand_filename']
   print(filename)

   ligand_index = params['ligand_index']
   del params['ligand_index']
   print(ligand_index)

   directory = params['directory']
   del params['directory']
   print(directory)

except:
   print("The file parameters.yaml does not exist!")
   sys.exit()

params['BEFLOW_REDIS_PORT'] += ligand_index
params['BEFLOW_GATEWAY_PORT'] += ligand_index

for k, v in params.items():
   print(k, v)
   os.environ[k] = str(v)

The environment variables are set before declaring the factory.

I prepare a directory for each ligand and then copy the script and parameters.yaml files to each of the directories. In each directory, I just change the ligand_index in the .yaml file and start the executor. This method enables running multiple executors/redis databases on the same machine at the same time. It also enables better control of the execution, but the drawback is that calculations are repeated if some segments are shared among the ligands.

openforcefield / openff-bespokefit

The executor fails with multiple ligands #237