is CUDA just messing with us?

lukeczapla commented 3 years ago

Hi guys,

So I was running my Python scripts over on MSKCC's LSF system and I accidentally left the platform as OpenCL and submitted it on their generic NVIDIA systems (usually GTX1080 I believe) and guess what? The OpenCL (not the CUDA) is giving me the better benchmark. I'd run these scripts by both John Chodera and one of the LSF experts over here and nobody found any flaws in my scripts - did any of you guys just try this out for 'ye old generic MD run' like something you might run in mixed precision mode (I have both OpenCL and CUDA flags for this in the script already - so it's an apples to apples comparison). I thought the LSF was messing with me giving me bad numbers like 46 ns/day and all kinds of random stuff, but then with the platform set to OpenCL I'm consistently getting 65 ns/day on a generic type of "lx" node they have over there - where the CUDA was chugging slowly at maybe 40 ns/day and I kept seeing random noise in benchmarks too.

So is this CUDA thing all just a gimmick or does anybody else have some interesting benchmarks to say the opposite of what I'm saying is true, that you should run CUDA with NVIDIA always (even if the lower benchmark's dragging you down). It's pretty much the same friggin language without a couple of fancy flags so maybe OpenMM likes OpenCL better? That is what my data currently suggests, and I just sort of "messed up" when submitting two new jobs to observe this great fact that's kicking me up a notch.

I had read this in independent articles comparing various programs so I'm not totally surprised either.

lukeczapla commented 3 years ago

also great news for all of us here too poor for an NVIDIA GPU that actually got about 21 ns/day on a cheap AMD RX580 in the comparison to the GTX1080 or GTX2080 or whatever better puppy is over there, and see identical output in the "top" command here or on the LSF regardless of whether it's CUDA or OpenCL.

lukeczapla commented 3 years ago

to be exactly factual it is GeForce RTX 2080 Ti

lukeczapla commented 3 years ago

I will just come up with little tag lines for this... like OpenMM + OpenCL = BOOM

It's actually probably important but I guess you don't gain any points for calling an AI/ML group a bunch of idiots because they hadn't even heard of the OpenCL compiler while you guys out here had been doing it since the copyrights in my .h files (like 2009), back when we were trying it out at Uppsala University too. So there could definitely be reverse cases but is it really worth the effort to optimize CUDA if OpenCL is starting out with a little heads up?

peastman commented 3 years ago

In most benchmarks, the CUDA platform is about 30% faster than OpenCL when running on the same GPU. I'd be very curious to know what calculations you're doing where it's the other way around!

lukeczapla commented 3 years ago

yea, please take a look, you know the generic Langevin integrator with all generic PME w/ force-switching and all the other features of a standard MD simulation went in here, not some GB-based thing that others may have suggested might do better on OpenCL - here is the link if you want you see the codes and it's hacked from Wonpil Im's original scripts to integrate everything new into the existing simple script they had: https://github.com/lukeczapla/kinases_r_us

I have added a RST file (that XML file with things like box vectors, positions, velocities) gzipped up for space so anyone can run it through the .csh script with the openmm_run.py by rewinding cnt back to 1 in run_lilac.csh and start from the "step 0" for the run.

So yup, kinases_r_us, a silly name but I was able to really have fun with the scripts and tell people it supports the prmtop/inpcrd from AMBER, and you could really build either FF with PSF/prm/pdb or prmtop/inpcrd, so it's all just preprocessing inputs beforehand and not doing anything during the "run" to slow it down besides that loop collecting aMD data every 10 steps (which seems to be what the simulation.py in the miniconda3 folder is doing as well) - feel free to take all the "junk" out because I think it stands up with the original systems in many ways - during a short "regular MD" 20 ns 'equilibration' with nothing fancy (i.e., post-constraints)..... it's not truly equilibration because it's just to estimate , the mean total dihedral energy for determining E and alpha for an aMD run, while I think for true equilibration of any complex system it is hard to really do anything but throw away a bunch at the beginning to say it was very different at first than real solution conditions (because usually we use a crystal structure as a model).

https://github.com/lukeczapla/kinases_r_us

I'd been looking for some competing code, of course, with OpenMM to see if this can also be benchmarked on the same systems, but dropping Acceleration, dropping CVForce's tying its force group to the integrator, and dropping measurements of CV's for histogram reweighting really didn't get me much back, maybe 19.5 ns/day vs. 21 ns/day but it's really the 19.5 ns/day at home that I can compare to these faster (65, 65, 64, 62 ns/day) benchmarks because they're just running the whole package unedited. Nobody had any idea what was up on their systems in terms of the lower and fluctuating CUDA benchmarks, and I could never break maybe 50 ns/day and only there with a lucky machine like TeslaV100.

If you find a bug/bad pattern (the platform = OpenCL line is replaced with the platform = CUDA line in run_lilac.csh to put CUDA back, so that is super simple) that left me far behind in the dust, let me know.

lukeczapla commented 3 years ago

In most benchmarks, the CUDA platform is about 30% faster than OpenCL when running on the same GPU. I'd be very curious to know what calculations you're doing where it's the other way around!

My system here would be:

Abl1 kinase with CHARMM36m(w/ CMAP) protein parameters, TIP3P water, standard (CHARMM22/27) K+ and Cl- ions PME/LJ12-6 with 10å-12å force switching (pretty standard for these) Monte Carlo Barostat with 1 atm pressure (avg box size about (78.6å)^3) 2 fs timestep ~49,000 particles Then some extras that locally dragged down from generic Langevin Integrator MD with 21 ns/day to 19.5 ns/day: LangevinVRORVAMDForceGroupIntegrator (adapted from just CustomIntegrator by looking John Chodera's more costly openmmtools version and AMDForceGroupIntegrator, all in one CustomIntegrator) CVForce (excluded from integration force groups but used to collect aMD Collective Variables for histogram reweighting) Python lists for all the colvars and deltaV (from aMD integrators) stored every 10 steps but outputted every 5000 steps with the StateGroupReporter data too.

So I would say this is a valid benchmark but it could be converted into a simpler form, something about an XML file that stores all the integrators, forces, and system in one? Is there a line to output that in Python?

lukeczapla commented 3 years ago

BTW with CPU usage yes I allocate a bunch of hyperthreaded cores (4 to be exact) with OPENMM_CPU_THREADS=4, but CPU usage of python in the "top" command is only 10% of 1 single core without my mods and 20% with my mods (probably the list append collection thing every 10 steps in Python - that's why a C-style language buffer with just double pointer would technically speed it up, or even just doing all of that on the GPU and never stopping! that is likely how NAMD could do it every 1-2 steps) - and those values are seen both on my Linux desktop (Ubuntu 20.04 with AMD RX580 GPU) and on LSF's top command when running on these fancier GPUs.

lukeczapla commented 3 years ago

@peastman if nobody really has a competing benchmark or tried this out yet, I'd definitely appreciate any advice starting up again with the C++ interface and the development repo - seems saving a bunch of collective variables on a GPU is not a particular difficult coding problem but without anyone to point me to something it could be a nightmare time vampire, I'd really like to see what happens to save it every 1 timestep because that's how NAMD did it and I ran it for 200 ns and still got halfway decent PMFs with less noise. It's weird because yes the snapshot barely moves in 2 or 4 fs but the deltaV (difference between real U_dihedral,total and boosted U*_dihedral,total) can move quite a lot - John was skeptical but it seems he's never run an aMD simulation before.

(200ns was "fairly decent" back in 2012-2013, but now if I run for 2 microseconds, I'm collecting the same number of points with 1 per every 10 steps, so I'll explore a lot more but have same # of points to factor in this complex -kT ln(sum(exp(beta*deltaV))) for the PMF by bin - more points means it's more likely you can even present 2D PMFs)

lukeczapla commented 3 years ago

our LSF HPC expert says with these new chips with 16GB memory or more, hardly anyone he sees in his console ever really uses it, so even the AMD RX580 with 8GB memory might be able to easily push a million+ particles and report some data and flush a buffer every 5000 integration steps.... seems feasible right?

openmm / openmm

is CUDA just messing with us? #3342