Compile kernels in parallel?

tristanic commented 3 years ago

Just wondering... is there any in-principle reason preventing ContextImpl::initialize() from running the initialization of each kernel in parallel? If it's possible, then std::async would make the actual implementation fairly straightforward... could possibly shave a couple of seconds off startup time for machines with a reasonable number of cores (which I realise is quite trivial for almost every application other than my own).

peastman commented 3 years ago

Parallelization at that high level probably wouldn't be practical. There's a lot of code in there that doesn't make any effort to be thread safe. In any case, most of it takes negligible time. Compiling the GPU kernels is usually the only expensive part. Ideally the kernel compiler would use multiple threads automatically, but that may not be the case.

What GPU and platform are you using? On CUDA we cache compiled kernels, and that helps a lot to reduce startup time. Some OpenCL implementations also cache kernels, but not all of them.

tristanic commented 3 years ago

What GPU and platform are you using?

All of them. :) I try to make ISOLDE as accessible as possible, so it goes by default with whatever requires the least extra prior knowledge from the user. In Linux, it'll default to CUDA if there's a compatible device and OpenCL otherwise; on the Mac it's OpenCL; in Windows it's OpenCL by default and CUDA as a user-switchable option (since Windows doesn't have a compiler available by default - I don't want to require people to install Visual Studio). The caching of kernels doesn't really help in this case, since essentially every simulation has a different configuration from the last (topology is constructed around an arbitrary user-defined selection of atoms from the larger model).

peastman commented 3 years ago

This would be quite challenging to do, and would have a very high risk of creating bugs. There's a lot of code that runs in one thread and isn't designed to be thread safe. That includes code from plugins, because they also have initialize() methods, and asking developers of third party plugins to guarantee their code is thread safe just isn't going to work!

tristanic commented 3 years ago

OK, understood. Seemed worth asking, anyway. In a serious ISOLDE session the user may end up starting and stopping a few hundred short simulations over the course of a working day, so even a 3-4 second startup per sim starts to add up to some serious time!

peastman commented 3 years ago

Here is the best I've come up with. The typical pattern is that in initialize() a class calls compileProgram() to get a ComputeProgram (this is where the actual compilation happens), then creates the kernels from it and sets up their arguments. For example,

https://github.com/openmm/openmm/blob/d8ef57fed6554ec95684e53768188e1f666405c9/platforms/common/src/CommonKernels.cpp#L7371-L7379

In some cases, though, it creates the kernels in initialize(), but waits until execute() to fully initialize them. For example,

https://github.com/openmm/openmm/blob/d8ef57fed6554ec95684e53768188e1f666405c9/platforms/common/src/CommonKernels.cpp#L5272-L5290

We could create an asynchronous version of compileProgram(). It would return a future object from which you could later retrieve the ComputeProgram. A class could choose to use it, requesting the compilation in initialize() and then waiting until execute() to retrieve the compiled program, create the kernels, and initialize their arguments. Nothing would be automatically parallelized, but if at least some classes chose to use the asynchronous mechanism that would get you some overlap in compilation.

This assumes that all compilers are thread safe, and allow you to compile multiple kernels for the same context on different threads at the same time. I don't know whether that's guaranteed either by the OpenCL or CUDA specs. We need to check it.

tristanic commented 3 years ago

That sounds promising! I've done quite a bit of playing around with the std::async library, and (as long as it's called with the std::launch::async option) it seems remarkably robust against over-subscribing of cores. On my MacBook air (2 physical cores, 4 hyperthreads) I've tried splitting a task into as many as 18 threads and still got a 2.5-fold speed-up over the single-threaded version.

Found this about OpenCL: https://www.khronos.org/registry/OpenCL/sdk/1.2/docs/man/xhtml/clSetKernelArg.html (courtesy of https://community.amd.com/t5/opencl/kernel-compiling-corruption-when-concurrent-compiling-enabled/m-p/259001#M24606):

All OpenCL API calls are thread-safe except clSetKernelArg. clSetKernelArg is safe to call from any host thread, and is safe to call re-entrantly so long as concurrent calls operate on different cl_kernel objects. However, the behavior of the cl_kernel object is undefined if clSetKernelArg is called from multiple host threads on the same cl_kernel object at the same time.

So in theory it sounds like OpenCL would be fine? Not quite as certain about CUDA, but... https://developer.nvidia.com/blog/boosting-productivity-and-performance-with-the-nvidia-cuda-11-2-c-compiler/

To mitigate the increase in build time arising from these multiple compilation passes, starting with the CUDA 11.2 release, the CUDA C++ compiler supports a new —threads command-line option (-t for short) to spawn separate threads to perform independent compilation passes in parallel. If multiple files are compiled in a single nvcc command, -t compiles the files in parallel. The argument determines the number of independent helper threads that the NVCC compiler spawns to perform independent compilation steps in parallel.

... which of course isn't the same thing as running multiple separate calls to nvcc. But it certainly sounds hopeful.

peastman commented 3 years ago

That's good to know. It sounds like this could be a viable approach.

peastman commented 3 years ago

I tried implementing this. It works, but it doesn't seem to help at all. I only moved a few kernel compilations to separate threads, but that ought to be enough to see some difference if it were going to help. I tested with CUDA and OpenCL, on Linux and Mac, with NVIDIA, AMD, and Intel GPUs. I haven't observed any reduction in context creation time in any case.

If you want to try it out, the code is in the async branch at https://github.com/peastman/openmm.

jchodera commented 3 years ago

In a serious ISOLDE session the user may end up starting and stopping a few hundred short simulations over the course of a working day, so even a 3-4 second startup per sim starts to add up to some serious time!

How do these simulations differ? Is it just in a few different restraints that differ from simulation to simulation? Could they be coded as some sort of general force where parameters change? Or could we just recompile only the few kernels that change each time and cache the rest?

peastman commented 3 years ago

For CUDA, we already cache the compiled kernels. And NVIDIA's OpenCL compiler automatically caches them.

jchodera commented 3 years ago

I wonder why it's taking ~4s to compile the minimally-modified new kernels for @tristanic, then. Presumably, it's just a few modified restraints. Perhaps these need to be moved to a separate CustomForce? Or will they automatically get coalesced with other larger kernels for speed?

tristanic commented 3 years ago

@peastman Thanks very much for trying. I'm surprised it didn't help any.

@jchodera Here's what happens in ISOLDE when a user starts a simulation:

The user selects one or more atoms and presses "play"
The selection is expanded out to whole residues, then backward and forward along each chain by (typically) three residues from each residue selected in step 1.
The selection is expanded further, to encompass all residues with at least one atom within 5 Å of any atom in the selection from step 1. This becomes the set of mobile atoms in the simulation.
A further selection is made, again selecting all residues approaching within 5 Å of the selection from step 3. This becomes the shell of fixed atoms around the mobile region.

Everything outside of the selections above simply isn't in the simulation. So you can see, except in the special cases of a very small model where the user just runs the whole thing every time, or where they start or stop repeated simulations from exactly the same selection, the overall topology is different every time. Since the number of atoms and interactions is hard-coded into each kernel, I think that means almost every kernel needs to be recompiled each time.

I wonder why it's taking ~4s to compile the minimally-modified new kernels for @tristanichttps://github.com/tristanic, then.

Well, 3-4 seconds is the total startup time, and Context initialisation is about 3/4 to 4/5 of that. I do find that OpenCL initialisation is on the order of 1s faster - but that does come with the downside of being a bit slower in the simulations itself (not a problem for small systems on good GPUs since the rate becomes graphics-limited anyway, but noticeable on bigger ones). Startup might well be faster on more modern hardware than my Linux workstation (which is still powerful, but definitely reaching the end of its service life)... I'm guessing having the scratch directory on an SSD would be particularly helpful here.

From: John Chodera @.> Sent: 25 May 2021 01:14 To: openmm/openmm @.> Cc: Tristan Croll @.>; Mention @.> Subject: Re: [openmm/openmm] Compile kernels in parallel? (#3107)

I wonder why it's taking ~4s to compile the minimally-modified new kernels for @tristanichttps://github.com/tristanic, then. Presumably, it's just a few modified restraints. Perhaps these need to be moved to a separate CustomForce? Or will they automatically get coalesced with other larger kernels for speed?

— You are receiving this because you were mentioned. Reply to this email directly, view it on GitHubhttps://github.com/openmm/openmm/issues/3107#issuecomment-847437827, or unsubscribehttps://github.com/notifications/unsubscribe-auth/AFM54YDZIAY3GR6SUXZFLTDTPLTXZANCNFSM44HM5NJA.

jchodera commented 3 years ago

@tristanic : If you're only simulating solute atoms, the simulation should still be blazingly fast even if you create a System containing all the solute atoms and a simple cutoff-dependent interaction potential. You could then create and maintain a Context where you just freeze inactive atoms by disabling them in a CustomIntegrator, where you simply update per-dof parameters so that they no longer move (assuming you're only doing dynamics, not minimization).

There may even be a way to update the masses to zero out particles that should not move during the simulation without recompiling all the kernels.

This may not be a good choice for enormous systems, but for most PDB files, I imagine it could work well: The first time you compile the kernels, there's a bit of a delay (or perhaps compile them in the background on a separate thread when you first show the system to your user?) and then keep the Context open until the session is closed or the topology is modified.

tristanic commented 3 years ago

@jchodera A few wrinkles make that approach difficult. Yes, it's true that on a decent GPU everything stays fast up to fairly large sizes (on my Titan Xp around 1,000 residues is where things start to drag a bit). But:

one of the more popular uses of ISOLDE is working on new and often enormous (megadalton range) cryo-EM complexes
roughly a third of ISOLDE downloads are the Mac version - on those little GPUs simulation size becomes much more critical - not just for performance reasons, but because the Mac OpenGL driver tends to fail pretty dramatically when too much memory is used by OpenCL.
(the real kicker) ISOLDE already supports some level of model building (that is, addition and removal of residues), and I plan to add more. The current version allows addition of arbitrary ligand residues from the components dictionary (not all of which are parameterised, of course - but that's a different challenge), and extension of protein chains by one residue at a time.

For the above reasons, I think the current approach of on-the-fly generation of new task-specific simulations solves more problems than it causes. I have thought about a few of the ideas you suggest, though (e.g. including a few uninitialised, hidden water molecules in each simulation so the user can add them without an expensive reinitialisation)... definitely some room for exploration there!

From: John Chodera @.> Sent: 25 May 2021 06:39 To: openmm/openmm @.> Cc: Tristan Croll @.>; Mention @.> Subject: Re: [openmm/openmm] Compile kernels in parallel? (#3107)

@tristanichttps://github.com/tristanic : If you're only simulating solute atoms, the simulation should still be blazingly fast even if you create a System containing all the solute atoms and a simple cutoff-dependent interaction potential. You could then create and maintain a Context where you just freeze inactive atoms by disabling them in a CustomIntegrator, where you simply update per-dof parameters so that they no longer move (assuming you're only doing dynamics, not minimization).

There may even be a way to update the masses to zero out particles that should not move during the simulation without recompiling all the kernels.

This may not be a good choice for enormous systems, but for most PDB files, I imagine it could work well: The first time you compile the kernels, there's a bit of a delay (or perhaps compile them in the background on a separate thread when you first show the system to your user?) and then keep the Context open until the session is closed or the topology is modified.

— You are receiving this because you were mentioned. Reply to this email directly, view it on GitHubhttps://github.com/openmm/openmm/issues/3107#issuecomment-847550037, or unsubscribehttps://github.com/notifications/unsubscribe-auth/AFM54YH7PJAMFCV3QJPV633TPMZY5ANCNFSM44HM5NJA.

peastman commented 3 years ago

Since the number of atoms and interactions is hard-coded into each kernel, I think that means almost every kernel needs to be recompiled each time.

It's not hardcoded in all of them, only in some performance critical ones. A lot of kernels don't change, so it can use the cached versions of them even when the system is changing.

Well, 3-4 seconds is the total startup time, and Context initialisation is about 3/4 to 4/5 of that.

Have you tried profiling to see what it's doing during that time? Compiling kernels might or might not be the actual bottleneck.

on those little GPUs simulation size becomes much more critical - not just for performance reasons, but because the Mac OpenGL driver tends to fail pretty dramatically when too much memory is used by OpenCL.

Have you tried using the CPU platform? On some computers with low end GPUs, it may have similar performance and it should eliminate the memory problems. Context creation will also be much faster.

tristanic commented 3 years ago

Have you tried profiling to see what it's doing during that time? Compiling kernels might or might not be the actual bottleneck.

I've only profiled in Python using cProfile to date, so can only go as far as saying that the big time sink is Context.__init__. Afraid I'm still not really familiar with the ins and outs of profiling compiled code... although I suppose it's about time I learned.

Have you tried using the CPU platform? On some computers with low end GPUs, it may have similar performance and it should eliminate the memory problems. Context creation will also be much faster.

Yep - I've tried it. OpenCL performance is still quite a bit faster than the CPU under realistic use scenarios (i.e. with a density map included as a potential based on a Continuous3DFunction. Some quick timings on my MacBook Air for my usual "moderately-sized crystal structure" model, 3io0 (229 residues), in "quick" mode (no implicit solvent, nonbonded forces cutoff at 0.9 nm, updating the graphics every 50 time steps):

OpenCL, with map: 170 time steps/s CPU, with map, map recalculation disabled: 120 time steps/s CPU, with map, map recalculation enabled: 50 time steps/s OpenCL, no map: 170 time steps/s CPU, no map: 195 time steps/s

The map recalculation is specific to crystallographic datasets, and is another reason why using the CPU platform is a bit of a no-go for ISOLDE. It's a rather computationally heavy task, which I've parallelised on the CPUs - this tends to work out quite nicely with the simulation running on the GPU, but leads to heavy contention when both simulation and map calcs are running on the CPUs. Anyway, (perhaps surprisingly?) the Continuous3DFunction map force seems to come at negligible cost on the GPU, but takes a large chunk of the time on the CPU platform.

tristanic commented 3 years ago

Here's a list of the Force objects in a typical ISOLDE simulation. Quite a few extra on top of your typical equilibrium sim...

sh = session.isolde.sim_handler

sys = sh._system

sys.getForces()
Out[3]: 
[<simtk.openmm.openmm.HarmonicBondForce; proxy of <Swig Object of type 'OpenMM::HarmonicBondForce *' at 0x000002844D3E3570> >,
 <simtk.openmm.openmm.HarmonicAngleForce; proxy of <Swig Object of type 'OpenMM::HarmonicAngleForce *' at 0x000002844D3CA540> >,
 <simtk.openmm.openmm.PeriodicTorsionForce; proxy of <Swig Object of type 'OpenMM::PeriodicTorsionForce *' at 0x000002844C7AC930> >,
 <simtk.openmm.openmm.NonbondedForce; proxy of <Swig Object of type 'OpenMM::NonbondedForce *' at 0x000002844C7ACC30> >,
 <simtk.openmm.openmm.CMAPTorsionForce; proxy of <Swig Object of type 'OpenMM::CMAPTorsionForce *' at 0x000002844D3D4150> >,
 <simtk.openmm.openmm.CustomExternalForce; proxy of <Swig Object of type 'OpenMM::CustomExternalForce *' at 0x000002844CB18C60> >,
 <simtk.openmm.openmm.CustomExternalForce; proxy of <Swig Object of type 'OpenMM::CustomExternalForce *' at 0x00000284495C57E0> >,
 <simtk.openmm.openmm.CustomBondForce; proxy of <Swig Object of type 'OpenMM::CustomBondForce *' at 0x000002844C7083C0> >,
 <simtk.openmm.openmm.CustomBondForce; proxy of <Swig Object of type 'OpenMM::CustomBondForce *' at 0x000002844DD8F030> >,
 <simtk.openmm.openmm.CustomTorsionForce; proxy of <Swig Object of type 'OpenMM::CustomTorsionForce *' at 0x000002844DD8F090> >,
 <simtk.openmm.openmm.CustomTorsionForce; proxy of <Swig Object of type 'OpenMM::CustomTorsionForce *' at 0x000002844DD8F210> >,
 <simtk.openmm.openmm.CustomCompoundBondForce; proxy of <Swig Object of type 'OpenMM::CustomCompoundBondForce *' at 0x000002844DD8F2D0> >,
 <simtk.openmm.openmm.CustomGBForce; proxy of <Swig Object of type 'OpenMM::CustomGBForce *' at 0x000002844DD8F2A0> >]

tristanic commented 3 years ago

Did a little profiling with operf using the two attached test cases (system.xml, before.pdb, testsystem.py: complete model; system2.xml, before2.pdb, testsystem2.py: system generated from partial selection). Deleted ~/.nv/ComputeCache before running (using ChimeraX as the Python interpreter, OpenMM version 7.5.0.dev-94d7225). Used two system to test how much of an effect the caching has. operf chimerax-daily --nogui --exit testsystem.py opreport -l

CPU: Intel Sandy Bridge microarchitecture, speed 3.8e+06 MHz (estimated)
Counted CPU_CLK_UNHALTED events (Clock cycles when not halted) with a unit mask of 0x00 (No unit mask) count 100000
samples  %        image name               app name                 symbol name
83891    22.1362  libOpenMM.so.7.5         ChimeraX                 /usr/libexec/UCSF-ChimeraX-daily/lib/libOpenMM.so.7.5
56295    14.8545  libnvidia-compiler.so.465.19.01 ChimeraX                 /usr/lib64/libnvidia-compiler.so.465.19.01
53555    14.1315  libnvidia-ptxjitcompiler.so.465.19.01 ChimeraX                 /usr/lib64/libnvidia-ptxjitcompiler.so.465.19.01
34074     8.9911  no-vmlinux               ChimeraX                 /no-vmlinux
15216     4.0150  libc-2.17.so             ChimeraX                 _int_free

operf chimerax-daily --nogui --exit testsystem2.py opreport -l

CPU: Intel Sandy Bridge microarchitecture, speed 3.8e+06 MHz (estimated)
Counted CPU_CLK_UNHALTED events (Clock cycles when not halted) with a unit mask of 0x00 (No unit mask) count 100000
samples  %        image name               app name                 symbol name
40386    18.8460  libnvidia-compiler.so.465.19.01 ChimeraX                 /usr/lib64/libnvidia-compiler.so.465.19.01
38237    17.8432  libnvidia-ptxjitcompiler.so.465.19.01 ChimeraX                 /usr/lib64/libnvidia-ptxjitcompiler.so.465.19.01
20449     9.5425  libOpenMM.so.7.5         ChimeraX                 /usr/libexec/UCSF-ChimeraX-daily/lib/libOpenMM.so.7.5
16384     7.6455  no-vmlinux               ChimeraX                 /no-vmlinux

Still pretty coarse profiling, but it does indicate that the majority of startup time is indeed going into the JIT compiler. I guess the good news there is that this is a problem that will reduce over time, with faster processors and more common use of SSDs - I'm using an SSD here, but my CPU (Xeon E5-2687W) is well behind the cutting edge and definitely approaching its retirement age.

testcase.tar.gz

jchodera commented 3 years ago

Are there any compiler options to muck with that might eliminate some expensive compiler steps at the cost of producing less optimized code?

dmclark17 commented 3 years ago

CUDA 11.5 enables partial concurrency for NVRTC compilation and PTX compilation. It may be interesting to retest the async branch. @peastman, how are you measuring the context creation time?

dmclark17 commented 1 year ago

CUDA 12.1 added the --split-compile option which allows for some compilation steps to be run in parallel: https://docs.nvidia.com/cuda/cuda-compiler-driver-nvcc/#split-compile-number-split-compile. The option should work with NVRTC as well.

I don't think this will benefit modules like the nonbonded kernel since there is only one non-inlined function. However, modules like findInteractingBlocks may see a benefit.

peastman commented 1 year ago

It sounds like this doesn't require any changes in our code beyond adding the flag? And it would still only compile one module at a time, but it would be faster?

The documentation lists that option as experimental. I assume that means we probably shouldn't use it in production quite yet?

dmclark17 commented 1 year ago

Yep, this would still be one module at a time. If the module can be split into multiple compilations units, it should be faster. The only code change would be adding the flag.

I'll run some tests with the feature enabled, but we can also hold off until it is no longer marked experimental.

openmm / openmm

Compile kernels in parallel? #3107