openmm / openmm-torch

OpenMM plugin to define forces with neural networks
173 stars 24 forks source link

Multi-GPU support? #125

Open jchodera opened 8 months ago

jchodera commented 8 months ago

How can we best support parallelization of ML potentials across GPUs?

We're dealing with models that are small enough to be replicated on each GPU, and only O(N) data (positions, box vectors) needs to be sent and O(N) data (forces) accumulated. Models like ANI should be trivially parallelizable across atoms.

peastman commented 8 months ago

OpenMM's infrastructure for parallel execution can in principle be applied to any Force. Internally it creates a separate ComputeContext for each device, and a separate copy of the KernelImpl for each one. All of them get executed in parallel, and any energies and forces they return are summed.

The challenge is figuring out what each of those KernelImpl's should do when it gets invoked. For many Forces this is simple. With most bonded forces, we can just divide up the bonds between GPUs, with each one computing a different subset. NonbondedForce is a bit more complicated, but we have ways of doing it.

What would TorchForce do? It doesn't know anything about the internal structure of the model. It just gets invoked once, taking all coordinates as inputs and producing the total energy as output. So the division of work would have to be done inside the model itself. We could pass in a pair of integers telling it how many devices it was executing on, and the index of the current device. The model would have to decide what to do with those inputs such that each device would do a similar amount of work, and the total energy would add up to the correct amount.

RaulPPelaez commented 8 months ago

Perhaps this would be something for NNPOps. We could provide there drop-in implementations of selected models that would be multi-GPU aware. This would need to be done on a model-by-model basis. I will leave this here for reference: https://pytorch.org/docs/stable/generated/torch.nn.DataParallel.html#torch.nn.DataParallel https://pytorch.org/docs/stable/multiprocessing.html

xiaowei-xie2 commented 1 month ago

Hi, I was wondering is there a way to run REMD (ReplicaExchangeSampler) with torchForce with multi GPU?

peastman commented 1 month ago

It should work exactly like any other force. Replica exchange is implemented at a higher level, using multiple Contexts for the replicas. It doesn't care how the forces in each Context are computed.

xiaowei-xie2 commented 1 month ago

Oh nice! Could you provide a simple example for how to do this? I came across this issue https://github.com/choderalab/openmmtools/issues/648, but could not figure out how to do it exactly.

peastman commented 1 month ago

I suggest asking on the openmmtools repo. The question isn't related to this package.

xiaowei-xie2 commented 1 month ago

Ok, I will do that. Thank you!