Open wiederm opened 1 year ago
In your code AFAIK you are using the CPU implementation, is this intended?
It's not intended to use the CPU implementation...
I thought using implementation = nnpops
is by default using the GPU (that's how nnpops is called in openmm-ml https://github.com/openmm/openmm-ml/blob/c3d8c28eb92bf5c4b16efb81ad7a44b707fc5907/openmmml/models/anipotential.py#L89 when using createSystem
). If I run the script on my machine the GPU is used but I get similar results as @wiederm (torchani
and nnpops
being equally fast)
Take this minimum example, which is similar to the example provided by @wiederm
import sys
from openmm import LangevinIntegrator, unit, Platform
from openmm.app import Simulation, StateDataReporter
from openmmml import MLPotential
from openmmtools.testsystems import WaterBox
# constants which might be modified by the user
step = 1000
waterbox = WaterBox(box_edge=15 * unit.angstrom)
nnp = MLPotential("ani2x")
platform = Platform.getPlatformByName("CUDA")
prop = dict(CudaPrecision="mixed")
for implementation in ("nnpops","torchani"):
print(f"Implementation: {implementation}")
ml_system = nnp.createSystem(waterbox.topology, implementation=implementation)
simulation = Simulation(
waterbox.topology,
ml_system,
LangevinIntegrator(300 * unit.kelvin, 1 / unit.picosecond, 1 * unit.femtosecond),
platform, prop
)
simulation.context.setPositions(waterbox.positions)
# Production
if step > 0:
print("\nMD run: %s steps" % step)
simulation.reporters.append(
StateDataReporter(
sys.stdout,
reportInterval=100,
step=True,
time=True,
potentialEnergy=True,
speed=True,
separator="\t",
)
)
simulation.step(step)
In my GPU, an RTX 2080 Ti I get this:
Implementation: nnpops
MD run: 1000 steps
#"Step" "Time (ps)" "Potential Energy (kJ/mole)" "Speed (ns/day)"
100 0.10000000000000007 -20461978.001629103 0
200 0.20000000000000015 -20462133.848855495 6.8
300 0.3000000000000002 -20462153.789688706 6.79
400 0.4000000000000003 -20462202.823631693 6.79
500 0.5000000000000003 -20462257.760451913 6.79
600 0.6000000000000004 -20462329.421256337 6.79
700 0.7000000000000005 -20462362.9969222 6.8
800 0.8000000000000006 -20462488.402703974 6.8
900 0.9000000000000007 -20462532.231097963 6.8
1000 1.0000000000000007 -20462481.48763666 6.8
Implementation: torchani
MD run: 1000 steps
#"Step" "Time (ps)" "Potential Energy (kJ/mole)" "Speed (ns/day)"
100 0.10000000000000007 -20456285.324413814 0
200 0.20000000000000015 -20451616.878087416 2.48
300 0.3000000000000002 -20445519.385244645 2.49
400 0.4000000000000003 -20438851.384950936 2.19
500 0.5000000000000003 -20431004.40918439 2.13
600 0.6000000000000004 -20426584.870540198 2.18
700 0.7000000000000005 -20415840.214279402 2.22
800 0.8000000000000006 -20411478.48251822 2.24
900 0.9000000000000007 -20409822.772401713 2.26
1000 1.0000000000000007 -20402172.29296462 2.27
Note, however, that the GPU utilization I am seeing for the torchani implementation is low (under 30%). Whereas NNPOps is using 100%. Which GPU are you running on?
I'm using a RTX2060. If I use your script, I get this output:
Warning on use of the timeseries module: If the inherent timescales of the system are long compared to those being analyzed, this statistical inefficiency may be an underestimate. The estimate presumes the use of many statistically independent samples. Tests should be performed to assess whether this condition is satisfied. Be cautious in the interpretation of the data.
Implementation: nnpops
/scratch/data/johannes/miniconda3/envs/openmmml-test/lib/python3.10/site-packages/torchani/__init__.py:55: UserWarning: Dependency not satisfied, torchani.ase will not be available
warnings.warn("Dependency not satisfied, torchani.ase will not be available")
Warning: importing 'simtk.openmm' is deprecated. Import 'openmm' instead.
/scratch/data/johannes/miniconda3/envs/openmmml-test/lib/python3.10/site-packages/torchani/resources/
MD run: 1000 steps
#"Step" "Time (ps)" "Potential Energy (kJ/mole)" "Speed (ns/day)"
100 0.10000000000000007 -20461915.171353746 0
200 0.20000000000000015 -20462175.772429183 3.8
300 0.3000000000000002 -20462196.65597004 3.77
400 0.4000000000000003 -20462139.43374104 3.78
500 0.5000000000000003 -20462263.351597134 3.79
600 0.6000000000000004 -20462383.616304602 3.79
700 0.7000000000000005 -20462338.201707978 3.79
800 0.8000000000000006 -20462476.36408945 3.79
900 0.9000000000000007 -20462493.882426914 3.79
1000 1.0000000000000007 -20462619.096036546 3.8
Implementation: torchani
/scratch/data/johannes/miniconda3/envs/openmmml-test/lib/python3.10/site-packages/torchani/resources/
MD run: 1000 steps
[W BinaryOps.cpp:594] Warning: floor_divide is deprecated, and will be removed in a future version of pytorch. It currently rounds toward 0 (like the 'trunc' function NOT 'floor'). This results in incorrect rounding for negative values.
To keep the current behavior, use torch.div(a, b, rounding_mode='trunc'), or for actual floor division, use torch.div(a, b, rounding_mode='floor'). (function operator())
#"Step" "Time (ps)" "Potential Energy (kJ/mole)" "Speed (ns/day)"
100 0.10000000000000007 -20461319.38262451 0
200 0.20000000000000015 -20461412.834623236 3.64
300 0.3000000000000002 -20461563.888187505 3.62
400 0.4000000000000003 -20461594.997226883 3.65
500 0.5000000000000003 -20461698.649372432 3.68
600 0.6000000000000004 -20461768.927100796 3.68
700 0.7000000000000005 -20461759.561995145 3.7
800 0.8000000000000006 -20462005.177399218 3.7
900 0.9000000000000007 -20462068.944122538 3.71
1000 1.0000000000000007 -20462172.986246087 3.7
For nnpops
I see a GPU utilization of 100% and for torchani
around 30%. These are the package in my environment:
cudatoolkit 11.4.2 h7a5bcfd_11 conda-forge
nnpops 0.3 cuda112py310h8b99da5_1 conda-forge
openmm 8.0.0 py310h5728c26_0 conda-forge
openmm-ml 1.0 pyhd8ed1ab_0 conda-forge
openmm-torch 1.0 cuda112py310hb8f62fa_0 conda-forge
openmmtools 0.21.4 pyhd8ed1ab_0 conda-forge
pytorch 1.12.1 cuda112py310he33e0d6_201 conda-forge
torchani 2.2.2 cuda112py310h98dee98_6 conda-forge
My GPU is much more powerful and yet you see more speed in torchani than me. I am clueless as to why, lets see if the rest have some insights. @raimis @peastman @sef43 Any ideas?
this is what I get on RTX3090, NNPOps is faster as expected but torchani is slower than all of the above:
Implementation: nnpops
MD run: 1000 steps
#"Step" "Time (ps)" "Potential Energy (kJ/mole)" "Speed (ns/day)"
100 0.10000000000000007 -20461951.441185422 0
200 0.20000000000000015 -20462188.18474654 9.89
300 0.3000000000000002 -20462210.048553117 9.92
400 0.4000000000000003 -20462229.56560606 9.91
500 0.5000000000000003 -20462397.756919313 9.92
600 0.6000000000000004 -20462263.190097418 9.91
700 0.7000000000000005 -20462424.72236422 9.91
800 0.8000000000000006 -20462420.394422207 9.91
900 0.9000000000000007 -20462443.257273544 9.9
1000 1.0000000000000007 -20462578.838789392 9.9
Implementation: torchani
MD run: 1000 steps
#"Step" "Time (ps)" "Potential Energy (kJ/mole)" "Speed (ns/day)"
100 0.10000000000000007 -20456325.17729549 0
200 0.20000000000000015 -20449428.474614438 0.685
300 0.3000000000000002 -20439698.984288476 0.97
400 0.4000000000000003 -20431854.536180284 1.13
500 0.5000000000000003 -20424569.579427063 1.22
600 0.6000000000000004 -20421131.122765917 1.29
700 0.7000000000000005 -20415078.246097725 1.34
800 0.8000000000000006 -20411468.562179044 1.37
900 0.9000000000000007 -20405688.49984586 1.38
1000 1.0000000000000007 -20401371.05406119 1.4
On my ancient GTX 1080 Ti:
MD run: 1000 steps
#"Step" "Time (ps)" "Potential Energy (kJ/mole)" "Speed (ns/day)"
100 0.10000000000000007 -20461986.05720992 0
200 0.20000000000000015 -20462011.271196663 4.32
300 0.3000000000000002 -20462185.580720104 4.29
400 0.4000000000000003 -20462218.965465758 4.29
500 0.5000000000000003 -20462187.70901094 4.28
600 0.6000000000000004 -20462401.649187673 4.29
700 0.7000000000000005 -20462252.494809993 4.29
800 0.8000000000000006 -20462291.34049955 4.29
900 0.9000000000000007 -20462193.367134728 4.29
1000 1.0000000000000007 -20462424.46822126 4.29
Implementation: torchani
MD run: 1000 steps
#"Step" "Time (ps)" "Potential Energy (kJ/mole)" "Speed (ns/day)"
100 0.10000000000000007 -20461523.953313816 0
200 0.20000000000000015 -20461520.8654142 1.72
300 0.3000000000000002 -20461375.710345387 1.64
400 0.4000000000000003 -20461691.991264936 1.7
500 0.5000000000000003 -20461734.920769025 1.8
600 0.6000000000000004 -20461817.857133113 1.83
700 0.7000000000000005 -20461817.208004408 1.89
800 0.8000000000000006 -20462153.94805858 1.92
900 0.9000000000000007 -20462099.346131183 1.96
1000 1.0000000000000007 -20462146.46586435 1.96
I have tested your script with two modifications (5K steps, write frequency set to 200 steps) on a RTX 3070 (not the same machine I posted my initial data) and I get the following:
Implementation: nnpops
MD run: 5000 steps
#"Step" "Time (ps)" "Potential Energy (kJ/mole)" "Speed (ns/day)"
200 0.20000000000000015 -20462058.886696402 0
400 0.4000000000000003 -20462261.770402238 5.13
600 0.6000000000000004 -20462289.659775756 5.13
800 0.8000000000000006 -20462297.250888396 5.13
1000 1.0000000000000007 -20462514.469884958 5.13
1200 1.1999999999999786 -20462481.41815422 5.13
1400 1.3999999999999566 -20462505.46846665 5.13
1600 1.5999999999999346 -20462626.978850227 5.13
1800 1.7999999999999126 -20462635.409385815 5.13
2000 1.9999999999998905 -20462620.65219273 5.13
2200 2.1999999999998687 -20462562.552356746 5.13
2400 2.3999999999998467 -20462679.17831285 5.13
2600 2.5999999999998247 -20462662.771694366 5.13
2800 2.7999999999998026 -20462858.939390674 5.13
3000 2.9999999999997806 -20462769.659467954 5.13
3200 3.1999999999997586 -20462989.99891446 5.13
3400 3.3999999999997366 -20462942.58059461 5.13
3600 3.5999999999997145 -20463033.362214305 5.13
3800 3.7999999999996925 -20462924.417510215 5.13
4000 3.9999999999996705 -20463011.34567156 5.13
4200 4.199999999999737 -20462983.498863857 5.13
4400 4.399999999999804 -20463055.54401257 5.13
4600 4.599999999999871 -20463034.46516973 5.13
4800 4.799999999999938 -20463039.68574196 5.13
5000 5.000000000000004 -20463034.924004197 5.13
Implementation: torchani
MD run: 5000 steps
#"Step" "Time (ps)" "Potential Energy (kJ/mole)" "Speed (ns/day)"
200 0.20000000000000015 -20461516.526204765 0
400 0.4000000000000003 -20461600.373352136 4.85
600 0.6000000000000004 -20461745.880840868 4.85
800 0.8000000000000006 -20461985.43875364 4.82
1000 1.0000000000000007 -20462241.704375446 4.76
1200 1.1999999999999786 -20462383.980617914 4.73
1400 1.3999999999999566 -20462447.03686768 4.74
1600 1.5999999999999346 -20462761.832365412 4.75
1800 1.7999999999999126 -20462865.32426318 4.77
2000 1.9999999999998905 -20462951.05244407 4.78
2200 2.1999999999998687 -20462853.98548076 4.79
2400 2.3999999999998467 -20463030.63675009 4.77
2600 2.5999999999998247 -20463056.779047217 4.75
2800 2.7999999999998026 -20463190.083291873 4.76
3000 2.9999999999997806 -20463250.281372238 4.77
3200 3.1999999999997586 -20463391.37891716 4.78
3400 3.3999999999997366 -20463425.523587838 4.78
3600 3.5999999999997145 -20463549.743160456 4.79
3800 3.7999999999996925 -20463630.800368927 4.79
4000 3.9999999999996705 -20463500.444433052 4.8
4200 4.199999999999737 -20463560.039706334 4.8
4400 4.399999999999804 -20463650.1258757 4.8
4600 4.599999999999871 -20463781.969737258 4.81
4800 4.799999999999938 -20463734.84937812 4.81
5000 5.000000000000004 -20463774.00930356 4.81
Given torchani's low GPU utilization, maybe CPU performance is playing a role here. Perhaps your original 2060 machine has a particularly powerful CPU. The rest can provide more insights, but maybe reporting just every 100 steps is somehow making reporting hide any potential gains in some systems. In essence I do not see anything weird going on here. Hopefully in the future we can make the gains get even better :P
Yes it seems to be very GPU dependent with the higher end cards with more CUDA cores having much more of a speedup from NNPOps.
I think it would be useful to collate some performance benchmarks like the above on different hardware, and system sizes, so people can know if their systems are running at expected speed, i.e similar to here: https://openmm.org/benchmarks
Did we ever figure out what was happening here?
Hi,
when I simulate a 15 Angstrom waterbox with the
torchani
andnnpops
implementation thetorchani
implementation is slightly faster. Isnnpops
only outperformingtorchani
with small system size? I have attached a minimum example to reproduce the shown output.min.py.zip