openmm / NNPOps

High-performance operations for neural network potentials
Other
81 stars 17 forks source link

NNPOps 0.6 #107

Closed raimis closed 1 year ago

raimis commented 1 year ago

Let's start the new release.

New features:

Testing improvements:

raimis commented 1 year ago

@RaulPPelaez any change to finish #105?

raimis commented 1 year ago

Does anybody want to implement/finish something more?

Ping: @peastman

RaulPPelaez commented 1 year ago

@RaulPPelaez any change to finish #105?

It defeated me tbh. I do not think it is worth stopping the release for it. I will give it another spin.

raimis commented 1 year ago

The test on the GPU don't pass: https://github.com/openmm/NNPOps/actions/runs/5645655178

RaulPPelaez commented 1 year ago

I think both failures could be attributed to the inaccuracy introduced by the non-deterministic order of operations in the CUDA versions. The first test yields an error of 0.0343 (tolerance is 0.025) after doing a sum operation. The second has a relative error tolerance of 1e-7, which is already a bit too much for single precision

peastman commented 1 year ago

Sounds good to me.

mikemhenry commented 1 year ago

Do we want to update the tests to make them less strict?

RaulPPelaez commented 1 year ago

Maybe this is enough? https://pytorch.org/docs/stable/generated/torch.use_deterministic_algorithms.html

RaulPPelaez commented 1 year ago

Both tests pass locally in a 4090 at the current master 5e2438d , perhaps the CI was just a fluke? EDIT: Also tried on a GPU with arch 75 (as the one in the CI) just in case and one of the BatchedNN tests failed also because some tolerance.

        assert energy_error < 5e-7
        if molFile == '3o99':
            assert grad_error < 0.05 # Some numerical instability
        else:
>           assert grad_error < 5e-3
E           AssertionError: assert tensor(0.0071, device='cuda:0') < 0.005

I do not think this is due to a bug, I would blame whatever kernel torch uses for sm_75 being a tad more inaccurate. Using deterministic algorithms does indeed make the error go away. I opened #108 to solve try in the CI.

mikemhenry commented 1 year ago

Okay so now that #108 is merged are we ready for a new release? There have been some breaking but very good changes in conda-forge packaging for CUDA https://github.com/conda-forge/conda-forge.github.io/issues/1963 so it might take a bit to get that sorted. I will get the release on conda-forge started, but will wait to actually merge it in until we decide if the release is ready.

RaulPPelaez commented 1 year ago

The changes should not affect CUDA<12 builds, right? AFAIK the new packages are only for CUDA >=12. Anyhow I say we are ready to release v0.6. cc @raimis, could you please?.

raimis commented 1 year ago

OK! I'm making the release

raimis commented 1 year ago

We have a release: https://github.com/openmm/NNPOps/releases/tag/v0.6

raimis commented 1 year ago

@mikemhenry you can proceed with https://github.com/conda-forge/nnpops-feedstock/pull/24. If CUDA 12 is an issue, we can build just for CUDA 11. And then sort out CUDA 12.

raimis commented 1 year ago

NNPOps 0.6 is out!

@jchodera could you make a tweet (or whatever is it called now)?