torchani faster than nnpops for larger systems?

wiederm commented 1 year ago

Hi,

when I simulate a 15 Angstrom waterbox with the torchani and nnpops implementation the torchani implementation is slightly faster. Is nnpops only outperforming torchani with small system size? I have attached a minimum example to reproduce the shown output.

# NNPOPS
Implementation: nnpops

MD run: 1000 steps
#"Step" "Time (ps)"     "Potential Energy (kJ/mole)"    "Speed (ns/day)"
100     0.10000000000000007     -20461968.233400125     0
200     0.20000000000000015     -20462109.582584146     3.02
300     0.3000000000000002      -20462215.08696869      3.02
400     0.4000000000000003      -20462184.75506845      3.02
500     0.5000000000000003      -20462176.182438154     3.02
600     0.6000000000000004      -20462290.934872355     3.02
700     0.7000000000000005      -20462276.06124924      3.02
800     0.8000000000000006      -20462268.749944247     3.01
900     0.9000000000000007      -20462303.856101606     3.01
1000    1.0000000000000007      -20462353.939166784     3.01
# TorchANI
Implementation: torchani

MD run: 1000 steps
#"Step" "Time (ps)"     "Potential Energy (kJ/mole)"    "Speed (ns/day)"
100     0.10000000000000007     -20456827.93509699      0
200     0.20000000000000015     -20453552.138266437     3.36
300     0.3000000000000002      -20446930.31249438      3.39
400     0.4000000000000003      -20442156.674454395     3.39
500     0.5000000000000003      -20434295.0773298       2.97
600     0.6000000000000004      -20432329.317804128     3.03
700     0.7000000000000005      -20427635.139502555     3
800     0.8000000000000006      -20422604.906581655     3.04
900     0.9000000000000007      -20420074.77440338      3.07
1000    1.0000000000000007      -20414884.105911426     3.09

min.py.zip

RaulPPelaez commented 1 year ago

In your code AFAIK you are using the CPU implementation, is this intended?

JohannesKarwou commented 1 year ago

It's not intended to use the CPU implementation... I thought using implementation = nnpops is by default using the GPU (that's how nnpops is called in openmm-ml https://github.com/openmm/openmm-ml/blob/c3d8c28eb92bf5c4b16efb81ad7a44b707fc5907/openmmml/models/anipotential.py#L89 when using createSystem). If I run the script on my machine the GPU is used but I get similar results as @wiederm (torchani and nnpops being equally fast)

RaulPPelaez commented 1 year ago

Take this minimum example, which is similar to the example provided by @wiederm

import sys

from openmm import LangevinIntegrator, unit, Platform
from openmm.app import Simulation, StateDataReporter
from openmmml import MLPotential
from openmmtools.testsystems import WaterBox

# constants which might be modified by the user
step = 1000
waterbox = WaterBox(box_edge=15 * unit.angstrom)
nnp = MLPotential("ani2x")
platform = Platform.getPlatformByName("CUDA")
prop = dict(CudaPrecision="mixed")

for implementation in ("nnpops","torchani"):
    print(f"Implementation: {implementation}")
    ml_system = nnp.createSystem(waterbox.topology, implementation=implementation)
    simulation = Simulation(
        waterbox.topology,
        ml_system,
        LangevinIntegrator(300 * unit.kelvin, 1 / unit.picosecond, 1 * unit.femtosecond),
        platform, prop
    )
    simulation.context.setPositions(waterbox.positions)
    # Production
    if step > 0:
        print("\nMD run: %s steps" % step)
        simulation.reporters.append(
            StateDataReporter(
                sys.stdout,
                reportInterval=100,
                step=True,
                time=True,
                potentialEnergy=True,
                speed=True,
                separator="\t",
            )
        )
        simulation.step(step)

In my GPU, an RTX 2080 Ti I get this:

Implementation: nnpops

MD run: 1000 steps
#"Step" "Time (ps)" "Potential Energy (kJ/mole)"    "Speed (ns/day)"
100 0.10000000000000007 -20461978.001629103 0
200 0.20000000000000015 -20462133.848855495 6.8
300 0.3000000000000002  -20462153.789688706 6.79
400 0.4000000000000003  -20462202.823631693 6.79
500 0.5000000000000003  -20462257.760451913 6.79
600 0.6000000000000004  -20462329.421256337 6.79
700 0.7000000000000005  -20462362.9969222   6.8
800 0.8000000000000006  -20462488.402703974 6.8
900 0.9000000000000007  -20462532.231097963 6.8
1000    1.0000000000000007  -20462481.48763666  6.8
Implementation: torchani

MD run: 1000 steps
#"Step" "Time (ps)" "Potential Energy (kJ/mole)"    "Speed (ns/day)"
100 0.10000000000000007 -20456285.324413814 0
200 0.20000000000000015 -20451616.878087416 2.48
300 0.3000000000000002  -20445519.385244645 2.49
400 0.4000000000000003  -20438851.384950936 2.19
500 0.5000000000000003  -20431004.40918439  2.13
600 0.6000000000000004  -20426584.870540198 2.18
700 0.7000000000000005  -20415840.214279402 2.22
800 0.8000000000000006  -20411478.48251822  2.24
900 0.9000000000000007  -20409822.772401713 2.26
1000    1.0000000000000007  -20402172.29296462  2.27

Note, however, that the GPU utilization I am seeing for the torchani implementation is low (under 30%). Whereas NNPOps is using 100%. Which GPU are you running on?

JohannesKarwou commented 1 year ago

I'm using a RTX2060. If I use your script, I get this output:

Warning on use of the timeseries module: If the inherent timescales of the system are long compared to those being analyzed, this statistical inefficiency may be an underestimate.  The estimate presumes the use of many statistically independent samples.  Tests should be performed to assess whether this condition is satisfied.   Be cautious in the interpretation of the data.
Implementation: nnpops
/scratch/data/johannes/miniconda3/envs/openmmml-test/lib/python3.10/site-packages/torchani/__init__.py:55: UserWarning: Dependency not satisfied, torchani.ase will not be available
  warnings.warn("Dependency not satisfied, torchani.ase will not be available")
Warning: importing 'simtk.openmm' is deprecated.  Import 'openmm' instead.
/scratch/data/johannes/miniconda3/envs/openmmml-test/lib/python3.10/site-packages/torchani/resources/

MD run: 1000 steps
#"Step" "Time (ps)" "Potential Energy (kJ/mole)"    "Speed (ns/day)"
100 0.10000000000000007 -20461915.171353746 0
200 0.20000000000000015 -20462175.772429183 3.8
300 0.3000000000000002  -20462196.65597004  3.77
400 0.4000000000000003  -20462139.43374104  3.78
500 0.5000000000000003  -20462263.351597134 3.79
600 0.6000000000000004  -20462383.616304602 3.79
700 0.7000000000000005  -20462338.201707978 3.79
800 0.8000000000000006  -20462476.36408945  3.79
900 0.9000000000000007  -20462493.882426914 3.79
1000    1.0000000000000007  -20462619.096036546 3.8
Implementation: torchani
/scratch/data/johannes/miniconda3/envs/openmmml-test/lib/python3.10/site-packages/torchani/resources/

MD run: 1000 steps
[W BinaryOps.cpp:594] Warning: floor_divide is deprecated, and will be removed in a future version of pytorch. It currently rounds toward 0 (like the 'trunc' function NOT 'floor'). This results in incorrect rounding for negative values.
To keep the current behavior, use torch.div(a, b, rounding_mode='trunc'), or for actual floor division, use torch.div(a, b, rounding_mode='floor'). (function operator())
#"Step" "Time (ps)" "Potential Energy (kJ/mole)"    "Speed (ns/day)"
100 0.10000000000000007 -20461319.38262451  0
200 0.20000000000000015 -20461412.834623236 3.64
300 0.3000000000000002  -20461563.888187505 3.62
400 0.4000000000000003  -20461594.997226883 3.65
500 0.5000000000000003  -20461698.649372432 3.68
600 0.6000000000000004  -20461768.927100796 3.68
700 0.7000000000000005  -20461759.561995145 3.7
800 0.8000000000000006  -20462005.177399218 3.7
900 0.9000000000000007  -20462068.944122538 3.71
1000    1.0000000000000007  -20462172.986246087 3.7

For nnpops I see a GPU utilization of 100% and for torchani around 30%. These are the package in my environment:

cudatoolkit               11.4.2              h7a5bcfd_11    conda-forge
nnpops                    0.3             cuda112py310h8b99da5_1    conda-forge
openmm                    8.0.0           py310h5728c26_0    conda-forge
openmm-ml                 1.0                pyhd8ed1ab_0    conda-forge
openmm-torch              1.0             cuda112py310hb8f62fa_0    conda-forge
openmmtools               0.21.4             pyhd8ed1ab_0    conda-forge
pytorch                   1.12.1          cuda112py310he33e0d6_201    conda-forge
torchani                  2.2.2           cuda112py310h98dee98_6    conda-forge

RaulPPelaez commented 1 year ago

My GPU is much more powerful and yet you see more speed in torchani than me. I am clueless as to why, lets see if the rest have some insights. @raimis @peastman @sef43 Any ideas?

sef43 commented 1 year ago

this is what I get on RTX3090, NNPOps is faster as expected but torchani is slower than all of the above:

Implementation: nnpops

MD run: 1000 steps
#"Step" "Time (ps)" "Potential Energy (kJ/mole)"    "Speed (ns/day)"
100 0.10000000000000007 -20461951.441185422 0
200 0.20000000000000015 -20462188.18474654  9.89
300 0.3000000000000002  -20462210.048553117 9.92
400 0.4000000000000003  -20462229.56560606  9.91
500 0.5000000000000003  -20462397.756919313 9.92
600 0.6000000000000004  -20462263.190097418 9.91
700 0.7000000000000005  -20462424.72236422  9.91
800 0.8000000000000006  -20462420.394422207 9.91
900 0.9000000000000007  -20462443.257273544 9.9
1000    1.0000000000000007  -20462578.838789392 9.9

Implementation: torchani

MD run: 1000 steps
#"Step" "Time (ps)" "Potential Energy (kJ/mole)"    "Speed (ns/day)"
100 0.10000000000000007 -20456325.17729549  0
200 0.20000000000000015 -20449428.474614438 0.685
300 0.3000000000000002  -20439698.984288476 0.97
400 0.4000000000000003  -20431854.536180284 1.13
500 0.5000000000000003  -20424569.579427063 1.22
600 0.6000000000000004  -20421131.122765917 1.29
700 0.7000000000000005  -20415078.246097725 1.34
800 0.8000000000000006  -20411468.562179044 1.37
900 0.9000000000000007  -20405688.49984586  1.38
1000    1.0000000000000007  -20401371.05406119  1.4

raimis commented 1 year ago

On my ancient GTX 1080 Ti:

MD run: 1000 steps
#"Step" "Time (ps)" "Potential Energy (kJ/mole)"    "Speed (ns/day)"
100 0.10000000000000007 -20461986.05720992  0
200 0.20000000000000015 -20462011.271196663 4.32
300 0.3000000000000002  -20462185.580720104 4.29
400 0.4000000000000003  -20462218.965465758 4.29
500 0.5000000000000003  -20462187.70901094  4.28
600 0.6000000000000004  -20462401.649187673 4.29
700 0.7000000000000005  -20462252.494809993 4.29
800 0.8000000000000006  -20462291.34049955  4.29
900 0.9000000000000007  -20462193.367134728 4.29
1000    1.0000000000000007  -20462424.46822126  4.29
Implementation: torchani

MD run: 1000 steps
#"Step" "Time (ps)" "Potential Energy (kJ/mole)"    "Speed (ns/day)"
100 0.10000000000000007 -20461523.953313816 0
200 0.20000000000000015 -20461520.8654142   1.72
300 0.3000000000000002  -20461375.710345387 1.64
400 0.4000000000000003  -20461691.991264936 1.7
500 0.5000000000000003  -20461734.920769025 1.8
600 0.6000000000000004  -20461817.857133113 1.83
700 0.7000000000000005  -20461817.208004408 1.89
800 0.8000000000000006  -20462153.94805858  1.92
900 0.9000000000000007  -20462099.346131183 1.96
1000    1.0000000000000007  -20462146.46586435  1.96

wiederm commented 1 year ago

I have tested your script with two modifications (5K steps, write frequency set to 200 steps) on a RTX 3070 (not the same machine I posted my initial data) and I get the following:

Implementation: nnpops

MD run: 5000 steps
#"Step" "Time (ps)" "Potential Energy (kJ/mole)"    "Speed (ns/day)"
200 0.20000000000000015 -20462058.886696402 0
400 0.4000000000000003  -20462261.770402238 5.13
600 0.6000000000000004  -20462289.659775756 5.13
800 0.8000000000000006  -20462297.250888396 5.13
1000    1.0000000000000007  -20462514.469884958 5.13
1200    1.1999999999999786  -20462481.41815422  5.13
1400    1.3999999999999566  -20462505.46846665  5.13
1600    1.5999999999999346  -20462626.978850227 5.13
1800    1.7999999999999126  -20462635.409385815 5.13
2000    1.9999999999998905  -20462620.65219273  5.13
2200    2.1999999999998687  -20462562.552356746 5.13
2400    2.3999999999998467  -20462679.17831285  5.13
2600    2.5999999999998247  -20462662.771694366 5.13
2800    2.7999999999998026  -20462858.939390674 5.13
3000    2.9999999999997806  -20462769.659467954 5.13
3200    3.1999999999997586  -20462989.99891446  5.13
3400    3.3999999999997366  -20462942.58059461  5.13
3600    3.5999999999997145  -20463033.362214305 5.13
3800    3.7999999999996925  -20462924.417510215 5.13
4000    3.9999999999996705  -20463011.34567156  5.13
4200    4.199999999999737   -20462983.498863857 5.13
4400    4.399999999999804   -20463055.54401257  5.13
4600    4.599999999999871   -20463034.46516973  5.13
4800    4.799999999999938   -20463039.68574196  5.13
5000    5.000000000000004   -20463034.924004197 5.13
Implementation: torchani

MD run: 5000 steps
#"Step" "Time (ps)" "Potential Energy (kJ/mole)"    "Speed (ns/day)"
200 0.20000000000000015 -20461516.526204765 0
400 0.4000000000000003  -20461600.373352136 4.85
600 0.6000000000000004  -20461745.880840868 4.85
800 0.8000000000000006  -20461985.43875364  4.82
1000    1.0000000000000007  -20462241.704375446 4.76
1200    1.1999999999999786  -20462383.980617914 4.73
1400    1.3999999999999566  -20462447.03686768  4.74
1600    1.5999999999999346  -20462761.832365412 4.75
1800    1.7999999999999126  -20462865.32426318  4.77
2000    1.9999999999998905  -20462951.05244407  4.78
2200    2.1999999999998687  -20462853.98548076  4.79
2400    2.3999999999998467  -20463030.63675009  4.77
2600    2.5999999999998247  -20463056.779047217 4.75
2800    2.7999999999998026  -20463190.083291873 4.76
3000    2.9999999999997806  -20463250.281372238 4.77
3200    3.1999999999997586  -20463391.37891716  4.78
3400    3.3999999999997366  -20463425.523587838 4.78
3600    3.5999999999997145  -20463549.743160456 4.79
3800    3.7999999999996925  -20463630.800368927 4.79
4000    3.9999999999996705  -20463500.444433052 4.8
4200    4.199999999999737   -20463560.039706334 4.8
4400    4.399999999999804   -20463650.1258757   4.8
4600    4.599999999999871   -20463781.969737258 4.81
4800    4.799999999999938   -20463734.84937812  4.81
5000    5.000000000000004   -20463774.00930356  4.81

RaulPPelaez commented 1 year ago

Given torchani's low GPU utilization, maybe CPU performance is playing a role here. Perhaps your original 2060 machine has a particularly powerful CPU. The rest can provide more insights, but maybe reporting just every 100 steps is somehow making reporting hide any potential gains in some systems. In essence I do not see anything weird going on here. Hopefully in the future we can make the gains get even better :P

sef43 commented 1 year ago

Yes it seems to be very GPU dependent with the higher end cards with more CUDA cores having much more of a speedup from NNPOps.

sef43 commented 1 year ago

I think it would be useful to collate some performance benchmarks like the above on different hardware, and system sizes, so people can know if their systems are running at expected speed, i.e similar to here: https://openmm.org/benchmarks

jchodera commented 1 year ago

Did we ever figure out what was happening here?

openmm / NNPOps

torchani faster than nnpops for larger systems? #85