Parallelisation with --n-qc-compute-workers

xiki-tempula commented 2 years ago

I'm running the openff-bespokefit with 12 works on a 96 core machine with the command openff-bespoke executor run --file "lig.sdf" --workflow "default" --output "lig.json" --output-force-field "lig.offxml" --n-qc-compute-workers 12 --qc-compute-n-cores 8 --default-qc-spec psi4 B3LYP-D3BJ DZVP I noticed that only 5 workers are being used (5 psi4 processes were being used). I wonder how is the parallelisation being set up, so I could better allocate the resources? Thank you.

jthorton commented 2 years ago

Hi @xiki-tempula good question! this is a tricky one and its hard to know in advance how to best split the resources as this depends on the number of torsiondrive tasks produced for the molecule. Currently, each worker can consume 1 torsiondrive task at a time so the fact you have 5 activate tasks probably means the molecule makes 5 torsiondrives, so it would be better to decrease the number of workers but give them each more cores.

We can also add some parallelisation to the torsiondrive tasks by performing multiple constrained optimisations simultaneously by editing an environment variable (note there are a lot of variables in bespokefit see here) the important one is BEFLOW_QC_COMPUTE_WORKER_N_TASKS which controls how many parallel optimisations each worker can do in a torsiondrive. So doing export BEFLOW_QC_COMPUTE_WORKER_N_TASKS=2 before a run (you can also set this in your bashrc) would allow each worker to run up to 2 optimisations at a time.

I would also look at adding this to your run command --qc-compute-max-mem this controls how much memory per-core workers can use, by default it will try and give every worker access to all of the memory which can lead to segfaults.

xiki-tempula commented 2 years ago

@jthorton Thanks for the explanation. I have been using TorsionDrive during my research and have done some dihedral parameterisation myself. The TorsionDrive uses Work Queue from cctools for parameterisation. If I do work_queue_worker --cores=96 to set up the worker Then in the torsiondrive, I modified the source code a bit such that torsiondrive would submit a 8 core job whenever a job is ready. Using this setup, for a molecule with 6 dihedrals, I could spawn 12 jobs * 8 cores in the beginning and then dynamically occupy all 96 cores. I wonder if the bespokenfit could have a similar setup? So we set the number of maximum number of cores and bespokenfit would dynamically fill in all the space.

With regard to BEFLOW_QC_COMPUTE_WORKER_N_TASKS. I have export BEFLOW_QC_COMPUTE_WORKER_N_TASKS=2 but it seems that still only 5 8-cores jobs were running at the same time. Which is the same as export BEFLOW_QC_COMPUTE_WORKER_N_TASKS=1

jthorton commented 2 years ago

I like that idea for openff-bespoke executor run entry point so users could supply a total number of cores and it would spin up N workers with X tasks per worker to best get through the jobs I'll defiantly look into adding this feature!

With regard to BEFLOW_QC_COMPUTE_WORKER_N_TASKS.

I think you might be running into some settings choices we hardcoded here, we found that using 8 cores with Psi4 gave good performance so we only divide the tasks if each one can have at least 8 cores so you would need to give each worker 16 cores to have both tasks running. Maybe we could remove this hard set limit though and let users decide. I also think I am using the wrong torsion drive procedure here and need to change this to our custom parallel version. I'll make a PR to fix these two issues!

xiki-tempula commented 1 year ago

Hi, with regard to the parallelisation. I wonder if this is currently done at fragment level, where each fragment occupy a worker? Or at torsion level, where each torsion scan occupy a worker? Or at the TorsionDrive level, where TorsionDrive will attempt to do a forward and backward drive so each torsion will spawn at least two workers?

openforcefield / openff-bespokefit

Parallelisation with --n-qc-compute-workers #166