Closed jchodera closed 7 years ago
I installed OpenMM on Titan with python 2.7 , it was able to recognize CUDA:
from simtk import openmm print([openmm.Platform.getPlatform(index).getName() for index in range(openmm.Platform.getNumPlatforms())]) ['Reference', 'CUDA', 'OpenCL']
however when I run the benchmark.py file I get the following error:
Deserializing simulation...
Traceback (most recent call last):
File "benchmark.py", line 38, in
It sounds like you installed OpenMM 6.3.1 via conda successfully, then!
Let me see if I can update the serialized version for OpenMM 6.3.1. I forgot that this wasn't backwards-compatible. Will do this now.
Also, I may be able to get the latest OpenMM conda package built for cuda 7.5, but more on that soon.
@jdakka : I've updated the serialized XML files in the PR (https://github.com/radical-collaboration/MSKCC/pull/2) to work with OpenMM 6.3.1. Give those a try.
Looks like it is running the benchmark OK interactively. I submitted PBS script using 8 nodes for 46648 atoms.
Looks like it is running the benchmark OK interactively.
Great!
I submitted PBS script using 8 nodes for 46648 atoms.
While OpenMM does support splitting a single system across multiple GPUs, it does so very inefficiently. Our use case is much closer to running N independent (or weakly-coupled) simulations on N GPUs, so if your test simply ran the same benchmark on each GPU, that should be close to what we want to estimate overall throughput. You should in principle only need to request one thread-slot per GPU, though we can use the other thread-slots available on each node for online analysis and updating of the dynamic workload balance in the future.
Tagging in @pgrinaway here too.
I'm also still working on building a more recent OpenMM conda package for CUDA 7.5, but no luck getting this to work yet.
@jchodera: quick remark: in the READme you specified that the simulation finished rather fast. completed 5000 steps in 26.521 s : performance is 32.578 ns/day
I’ve been running this benchmark for 45 minutes yet it’s still stuck on the benchmarking step. I noticed that it uses the “reference” platform instead of CUDA like you had in the example. Is there a way for have the integrator point to the “right” platform. I assume it would have picked up on CUDA since it is available like I showed previously:
print([openmm.Platform.getPlatform(index).getName() for index in range(openmm.Platform.getNumPlatforms())]) ['Reference', 'CUDA', 'OpenCL']
On May 27, 2017, at 12:58 AM, John Chodera notifications@github.com wrote:
I'm also still working on building a more recent OpenMM conda package for CUDA 7.5, but no luck getting this to work yet.
— You are receiving this because you were mentioned. Reply to this email directly, view it on GitHub https://github.com/radical-collaboration/MSKCC/issues/3#issuecomment-304427749, or mute the thread https://github.com/notifications/unsubscribe-auth/AQ_iAlOir5Bfd8CzL4wiOsdDRkh9tFfxks5r961dgaJpZM4NoI97.
The Reference
platform is a slow single-threaded double-precision implementation, and would take hours (or days!) to complete the benchmark. We need to get either CUDA
(fastest) or OpenCL
to run here.
It normally tries CUDA
first, then OpenCL
, then CPU
, then Reference
, so I'm not quite sure what the problem is if you find all are available. Can you get the benchmark to run interactively on a node with a GPU where it lists the CUDA
platform as available? It should only take a minute or two to run if it's using CUDA
.
I think it must be that some kernels are not available for the CUDA
platform for some reason, though I'm not sure why. I'll add a few more debug lines to see if we can identify what is going on.
I suspect this may indicate we need to install OpenMM 7.1.1 from source, however.
The info I was referring to is from the interactive session. I added in the platforms as the first print statement. I’m curious to see how OpenMM 7.1.1 works. Let me check both versions on another cluster.
jdakka@titan-batch6:~/mskcc/MSKCC/abl-imatinib-benchmark> python benchmark.py ['Reference', 'CUDA', 'OpenCL'] Deserializing simulation... System contains 46648 atoms. Using platform "Reference”.
On May 27, 2017, at 10:14 AM, John Chodera notifications@github.com wrote:
The Reference platform is a slow single-threaded double-precision implementation, and would take hours (or days!) to complete the benchmark. We need to get either CUDA (fastest) or OpenCL to run here.
It normally tries CUDA first, then OpenCL, then CPU, then Reference, so I'm not quite sure what the problem is if you find all are available. Can you get the benchmark to run interactively on a node with a GPU where it lists the CUDA platform as available? It should only take a minute or two to run if it's using CUDA.
I think it must be that some kernels are not available for the CUDA platform for some reason, though I'm not sure why. I'll add a few more debug lines to see if we can identify what is going on.
I suspect this may indicate we need to install OpenMM 7.1.1 from source, however.
— You are receiving this because you were mentioned. Reply to this email directly, view it on GitHub https://github.com/radical-collaboration/MSKCC/issues/3#issuecomment-304454491, or mute the thread https://github.com/notifications/unsubscribe-auth/AQ_iAhyeRwNMeVlLhR8ccL8O9dlFsxRfks5r-C_LgaJpZM4NoI97.
So I tested the script against Xstream. They have openMM 6.3.1 as a module compiled against CUDA 7.0. The benchmark was able to correctly latch onto CUDA.
[xs-jdakka@xs-0005 ~/mskcc/MSKCC/abl-imatinib-benchmark]$ python benchmark.py ['Reference', 'CPU', 'CUDA', 'OpenCL'] Deserializing simulation... System contains 46648 atoms. Using platform "CUDA". Initial potential energy is -141208.484 kcal/mol Warming up integrator to trigger kernel compilation... Benchmarking...
On May 27, 2017, at 1:24 PM, Jumana Dakka jumanadakka@gmail.com wrote:
The info I was referring to is from the interactive session. I added in the platforms as the first print statement. I’m curious to see how OpenMM 7.1.1 works. Let me check both versions on another cluster.
jdakka@titan-batch6:~/mskcc/MSKCC/abl-imatinib-benchmark> python benchmark.py ['Reference', 'CUDA', 'OpenCL'] Deserializing simulation... System contains 46648 atoms. Using platform "Reference”.
On May 27, 2017, at 10:14 AM, John Chodera <notifications@github.com mailto:notifications@github.com> wrote:
The Reference platform is a slow single-threaded double-precision implementation, and would take hours (or days!) to complete the benchmark. We need to get either CUDA (fastest) or OpenCL to run here.
It normally tries CUDA first, then OpenCL, then CPU, then Reference, so I'm not quite sure what the problem is if you find all are available. Can you get the benchmark to run interactively on a node with a GPU where it lists the CUDA platform as available? It should only take a minute or two to run if it's using CUDA.
I think it must be that some kernels are not available for the CUDA platform for some reason, though I'm not sure why. I'll add a few more debug lines to see if we can identify what is going on.
I suspect this may indicate we need to install OpenMM 7.1.1 from source, however.
— You are receiving this because you were mentioned. Reply to this email directly, view it on GitHub https://github.com/radical-collaboration/MSKCC/issues/3#issuecomment-304454491, or mute the thread https://github.com/notifications/unsubscribe-auth/AQ_iAhyeRwNMeVlLhR8ccL8O9dlFsxRfks5r-C_LgaJpZM4NoI97.
So I tested the script against Xstream. They have openMM 6.3.1 as a module compiled against CUDA 7.0. The benchmark was able to correctly latch onto CUDA.
[xs-jdakka@xs-0005 ~/mskcc/MSKCC/abl-imatinib-benchmark]$ python benchmark.py ['Reference', 'CPU', 'CUDA', 'OpenCL'] Deserializing simulation... System contains 46648 atoms. Using platform "CUDA". Initial potential energy is -141208.484 kcal/mol Warming up integrator to trigger kernel compilation... Benchmarking...
On May 27, 2017, at 1:24 PM, Jumana Dakka <jumanadakka@gmail.com mailto:jumanadakka@gmail.com> wrote:
The info I was referring to is from the interactive session. I added in the platforms as the first print statement. I’m curious to see how OpenMM 7.1.1 works. Let me check both versions on another cluster.
jdakka@titan-batch6:~/mskcc/MSKCC/abl-imatinib-benchmark> python benchmark.py ['Reference', 'CUDA', 'OpenCL'] Deserializing simulation... System contains 46648 atoms. Using platform "Reference”.
On May 27, 2017, at 10:14 AM, John Chodera <notifications@github.com mailto:notifications@github.com> wrote:
The Reference platform is a slow single-threaded double-precision implementation, and would take hours (or days!) to complete the benchmark. We need to get either CUDA (fastest) or OpenCL to run here.
It normally tries CUDA first, then OpenCL, then CPU, then Reference, so I'm not quite sure what the problem is if you find all are available. Can you get the benchmark to run interactively on a node with a GPU where it lists the CUDA platform as available? It should only take a minute or two to run if it's using CUDA.
I think it must be that some kernels are not available for the CUDA platform for some reason, though I'm not sure why. I'll add a few more debug lines to see if we can identify what is going on.
I suspect this may indicate we need to install OpenMM 7.1.1 from source, however.
— You are receiving this because you were mentioned. Reply to this email directly, view it on GitHub https://github.com/radical-collaboration/MSKCC/issues/3#issuecomment-304454491, or mute the thread https://github.com/notifications/unsubscribe-auth/AQ_iAhyeRwNMeVlLhR8ccL8O9dlFsxRfks5r-C_LgaJpZM4NoI97.
I tested OpenMM 7.1.1 through conda installation on Xstream and it only references CUDA if CUDA 8.0 is loaded.
On May 27, 2017, at 2:01 PM, Jumana Dakka jumanadakka@gmail.com wrote:
So I tested the script against Xstream. They have openMM 6.3.1 as a module compiled against CUDA 7.0. The benchmark was able to correctly latch onto CUDA.
[xs-jdakka@xs-0005 ~/mskcc/MSKCC/abl-imatinib-benchmark]$ python benchmark.py ['Reference', 'CPU', 'CUDA', 'OpenCL'] Deserializing simulation... System contains 46648 atoms. Using platform "CUDA". Initial potential energy is -141208.484 kcal/mol Warming up integrator to trigger kernel compilation... Benchmarking...
On May 27, 2017, at 1:24 PM, Jumana Dakka <jumanadakka@gmail.com mailto:jumanadakka@gmail.com> wrote:
The info I was referring to is from the interactive session. I added in the platforms as the first print statement. I’m curious to see how OpenMM 7.1.1 works. Let me check both versions on another cluster.
jdakka@titan-batch6:~/mskcc/MSKCC/abl-imatinib-benchmark> python benchmark.py ['Reference', 'CUDA', 'OpenCL'] Deserializing simulation... System contains 46648 atoms. Using platform "Reference”.
On May 27, 2017, at 10:14 AM, John Chodera <notifications@github.com mailto:notifications@github.com> wrote:
The Reference platform is a slow single-threaded double-precision implementation, and would take hours (or days!) to complete the benchmark. We need to get either CUDA (fastest) or OpenCL to run here.
It normally tries CUDA first, then OpenCL, then CPU, then Reference, so I'm not quite sure what the problem is if you find all are available. Can you get the benchmark to run interactively on a node with a GPU where it lists the CUDA platform as available? It should only take a minute or two to run if it's using CUDA.
I think it must be that some kernels are not available for the CUDA platform for some reason, though I'm not sure why. I'll add a few more debug lines to see if we can identify what is going on.
I suspect this may indicate we need to install OpenMM 7.1.1 from source, however.
— You are receiving this because you were mentioned. Reply to this email directly, view it on GitHub https://github.com/radical-collaboration/MSKCC/issues/3#issuecomment-304454491, or mute the thread https://github.com/notifications/unsubscribe-auth/AQ_iAhyeRwNMeVlLhR8ccL8O9dlFsxRfks5r-C_LgaJpZM4NoI97.
I tested OpenMM 7.1.1 through conda installation on Xstream and it only references CUDA if CUDA 8.0 is loaded.
That matches what we expect. The CUDA platform unfortunately needs to be linked against a specific version of CUDA---in this case, the 7.1.x release is for CUDA 8.0, which is the current stable release version of CUDA, released 5 Apr 2016---over a year ago.
I've been fighting with our docker build system to try to compile a CUDA 7.5 conda build for you to try on Titan, but I haven't had any luck so far.
You should at least be able to get the OpenMM 7.1.1 conda package to run the OpenCL platform on Titan provided OpenCL libraries are installed---this doesn't require OpenMM be linked against a particular CUDA version. It's ~25% slower, but not unusably slow. I'm not sure why it fails to run your system, however.
You can try to force a particular platform by changing
context = openmm.Context(system, integrator)
to
# Try to force the OpenCL platform
platform = openmm.Platform.getPlatformByName('OpenCL')
context = openmm.Context(system, integrator, platform)
@jdakka : I think I've managed to solve the issues with building a conda version of the latest (git head) OpenMM against CUDA 7.5. Hopefully will get it posted in the next few hours.
@jdakka : Success! (I hope!)
Give this a try and see if you find the CUDA platform is usable on systems with CUDA 7.5:
conda install --yes -c omnia/label/cuda75 openmm==7.2.0
If so, I can also update the benchmark input files.
If you need to forcibly remove the conda-installed openmm, you can use
# Remove openmm
conda remove --yes openmm
# Clean the package cache
# You might have to omit the `s` from `-tipsy` if you don't have `conda-build` installed
conda clean -tipsy openmm
I tried again but no luck, I specified the CUDA platform in the code and it spits this error (also included the modules list and conda list in case you see anything that is missing)
jdakka@titan-batch5:~/mskcc/MSKCC/abl-imatinib-benchmark> python benchmark.py
['Reference', 'CUDA', 'OpenCL']
Deserializing simulation...
Traceback (most recent call last):
File "benchmark.py", line 43, in
#
blas 1.1 openblas conda-forge
ca-certificates 2017.4.17 0 conda-forge
certifi 2017.4.17 py27_0 conda-forge
fftw3f 3.3.4 2 omnia
libgfortran 3.0.0 1
ncurses 5.9 10 conda-forge
numpy 1.12.1 py27_blas_openblas_200 [blas_openblas] conda-forge
openblas 0.2.19 2 conda-forge
openmm 7.2.0 py27_0 omnia/label/cuda75
openssl 1.0.2k 0 conda-forge
pip 9.0.1 py27_0 conda-forge
python 2.7.13 1 conda-forge
readline 6.2 0 conda-forge
setuptools 33.1.1 py27_0 conda-forge
sqlite 3.13.0 1 conda-forge
tk 8.5.19 1 conda-forge
wheel 0.29.0 py27_0 conda-forge
zlib 1.2.11 0 conda-forge
Well, the good news is that it seems we've compiled against the correct CUDA 7.5 libraries since the CUDA platform is available, but the bad news is that it is not launching in a way that provides access to a GPU.
In an interactive session, are you able to run nvidia-smi
and see that the GPU resource you requested appears? You might also check to see if CUDA_VISIBLE_DEVICES
is set to something.
We'll need @pgrinaway's help to debug from here, I imagine. @pgrinaway: Have you managed to get Titan access too?
I don't think @pgrinaway has Titan access.
@pgrinaway: Here is a starting point: https://www.olcf.ornl.gov/kb_articles/user-account-requests/
Please use CSC230 as project ID. Let me or @jdakka know if there is a problem.
The interactive nodes I requested are compute nodes which have GPUs yet it doesn’t appear to recognize the GPU. Trying to run other CUDA scripts to see where the issue is.
On May 29, 2017, at 1:32 PM, John Chodera notifications@github.com wrote:
Well, the good news is that it seems we've compiled against the correct CUDA 7.5 libraries since the CUDA platform is available, but the bad news is that it is not launching in a way that provides access to a GPU.
In an interactive session, are you able to run nvidia-smi and see that the GPU resource you requested appears? You might also check to see if CUDA_VISIBLE_DEVICES is set to something.
We'll need @pgrinaway https://github.com/pgrinaway's help to debug from here, I imagine. @pgrinaway https://github.com/pgrinaway: Have you managed to get Titan access too?
— You are receiving this because you were mentioned. Reply to this email directly, view it on GitHub https://github.com/radical-collaboration/MSKCC/issues/3#issuecomment-304705167, or mute the thread https://github.com/notifications/unsubscribe-auth/AQ_iAmIjisbxqPCHYy_oX-LK9J2ECZshks5r-wEagaJpZM4NoI97.
I don't think @pgrinaway has Titan access.
That's correct. I was originally relying on access through SiTx's proposal, which is still in review (they told me it would be about a month before I received a token). There's not much I can do about that one yet.
Please use CSC230 as project ID. Let me or @jdakka know if there is a problem.
Ok. Is it permissible to have multiple accounts for the same person? I will be doing calculations under another account as well.
Short answer is yes you can have multiple accounts for the same person.
I'm not a 100% sure how Titan/OLCF binds accounts and allocations/projects. On XSEDE you have one account (bound to you) and that account can be used with N allocations/projects. At OLCF, till last year, you would get an account for each project/allocation you were on, but either way, you were allowed to join > 1 project.
Hth
@jchodera I figured out the issue but OpenMM 7.2/cuda75 won't install with the command you provided. I was able to install OpenMM 6.3 and OpenMM 7.1 but during execution it latched to the CPU instead, even though it recognized CUDA as well.
PackageNotFoundError: Package not found: Conda could not find '
@jdakka @jchodera Maybe it is better to take this problem to the OpenMM mailing list, rather than have @jchodera troubleshoot?
No need, @pgrinaway and I are the right people to support this, though I was stuck in a meeting all day. Give me a moment to find the issue.
Here's the syntax to use:
conda remove --yes openmm
conda clean -plti ---yes
conda install ---yes -c omnia/label/cuda75 openmm
If this doesn't install OpenMM 7.2 from the cuda75 label, let me know what it prints as output.
Thanks so much for being our hands and eyes in working through the OpenMM benchmarking issues!
@jchodera it installs the package but I'm not seeing the correct version that you have nor and cuda75 label... `conda install -c omnia/label/cuda75 openmm Fetching package metadata ............... Solving package specifications: .
Package plan for installation in environment /ccs/proj/csc230/mskcc/miniconda/envs/venv:
The following NEW packages will be INSTALLED:
openmm: 7.1.1-py27_0 omnia`
I think I know what happened: The package must have been routed to the wrong label and then overwritten by our nightly dev builds. Let me fix that. Will take about an hour. Apologies again!
OK, try this:
conda remove --yes openmm
conda clean -plti --yes
conda install -c omnia/label/dev --yes openmm-cuda75
Success! Quite a few environmental setup hurdles but it's working (I'll write a formal set of instructions in the READ.me for Titan specifically)
(venv) jdakka@titan-batch8:/lustre/atlas/proj-shared/csc230/mskcc/MSKCC/abl-imatinib-benchmark> aprun -n1 python benchmark.py
['Reference', 'CPU', 'CUDA', 'OpenCL']
Deserializing simulation...
System contains 46648 atoms.
Using platform "CUDA".
Initial potential energy is -141208.481 kcal/mol
Warming up integrator to trigger kernel compilation...
Benchmarking...
completed 5000 steps in 22.567 s : performance is 38.286 ns/day
Final potential energy is -141316.976 kcal/mol
Application 14507233 resources: utime ~49s, stime ~6s, Rss ~350156, inblocks ~315692, outblocks ~69481
Huzzah!
If we want to benchmark the systems mentioned in the NAMD issue here too, I can add those as well.
@jchodera -- you're my hero! truly impressed.
@jdakka : I'd give OpenMM 6.3.1 a try first, since I believe that is built against CUDA 7.5. After you've installed miniconda Python, you can
If you run our benchmark script, you should see it mention that it is using the
CUDA
platform. If it saysOpenCL
orCPU
, it's not able to use your CUDA libraries for some reason. You can check which platforms are available with(my mac doesn't have
CUDA
available)If you need to build from OpenMM from source for CUDA 7.5, you should be able to follow the instructions here on compiling OpenMM from source using CUDA 7.5 installed on Titan. Be sure to pay attention to the dependencies.
It's best to install miniconda Python first anyway, since we can use that to easily install other dependencies for our scripts if needed. (We've tried to minimize dependencies in this initial benchmark script, but future elaborations will require more conda-installable dependencies.)
I've tried to save you some pain by building a conda-installable OpenMM built against CUDA 7.5, but no luck so far.