getNeighborPairs() supports periodic boundary conditions

peastman commented 1 year ago

I've implemented the CPU version but not the CUDA version so far. Please take a look and see if the API and implementation look OK.

peastman commented 1 year ago

I added the CUDA implementation. It mostly works, but test_neighbor_grads() fails with an error I'm not sure of the best way to handle:

RuntimeError: function torch::autograd::CppNode returned an incorrect number of gradients (expected 4, got 3)

Since I added box_vectors as a fourth argument, autograd expects it to return the gradient with respect to that argument. But we don't calculate it, and I'm not sure it would even really make sense. Any gradient with respect to box vectors will be full of discontinuities.

raimis commented 1 year ago

Just return an empty tensor Tensor() to indicated that the argument is not differentiable.

peastman commented 1 year ago

Thanks! I made the change and the test now passes.

When I run the complete TestNeighbors.py suite, I still get errors in the CUDA version of test_periodic_neighbors():

RuntimeError: CUDA error: device-side assert triggered

After a while of debugging, I figured out it doesn't really have anything to do with that test. It's actually caused by test_too_many_neighbors(), which intentionally triggers an assertion. The error condition somehow isn't getting cleared, such that all CUDA tests run after it always fail. You can observe this by simply adding the line pt.cuda.synchronize() to the end of test_too_many_neighbors(). That will cause it to always fail.

peastman commented 1 year ago

Any suggestions about what to do with test_too_many_neighbors()? As far as I can tell torch.cuda doesn't provide any way to reset the device. Once an assert has been triggered, there's no way to clear it and any further CUDA operation in that process will fail.

The obvious solution is not to run that test on CUDA.

raimis commented 1 year ago

One option is to call cudaResetDevice using ctypes (https://docs.python.org/3/library/ctypes.html).

peastman commented 1 year ago

If pytorch doesn't provide a safe way to reset the device, going behind its back to call a CUDA function directly will likely cause errors as well. That will invalidate all its existing handles to resources on the GPU, but it doesn't know they've been invalidated.

For the moment, I've limited that test to CPU. It's not ideal, but I don't have a better solution.

raimis commented 1 year ago

Each test is run in a separate process. So, after the device is reset PyTorch will follow with normal initialization for the next test.

peastman commented 1 year ago

You're welcome to see if you can figure out a way to get it to work. But in the mean time, let's not hold up a useful feature over a broken unit test that isn't even related to the new feature.

raimis commented 1 year ago

OK! Let's disable the test for now. What else is missing to finish this PR?

peastman commented 1 year ago

It's all ready for review.

raimis commented 1 year ago

Great! I'll look at it.

raimis commented 1 year ago

@RaulPPelaez how do you handle kernel errors in your code?

We need something to be:

Compatible with CUDA graphs
Low overhead
Clear error messages for users

RaulPPelaez commented 1 year ago

AFAIK there is no clean way to assert with CUDA. As you mentioned device assert leaves the CUDA context in an unusable state. What I normally end up doing is to have some errorState array/value in device (or managed) memory. A thread in a kernel encountering an error atomically writes to this errorState and returns as fast as possible. Then you delay as much as possible checking this value for errors. For instance, if you at least have a record of this error state, the user can query it manually with some kind of checkErrorState() when he notices results are incorrect (unless the code just crashes, that is). I have never found a clean way to do this without requiring some kind of synchronization (like a device-host copy or a stream sync).

If you think about it, this is the way errors work in CUDA. You need to manually synchronize to query the current error state. e.g auto err = cudaDeviceSynchronize(); So if they have not figured out a better way...

RaulPPelaez commented 1 year ago

Something like this https://github.com/RaulPPelaez/UAMMD/blob/f0447444ef1f9d7520e661b6d7151c16b9cfdb1f/src/Interactor/NeighbourList/CellList/CellListBase.cuh#L223-L244

RaulPPelaez commented 1 year ago

Any suggestions about what to do with test_too_many_neighbors()? As far as I can tell torch.cuda doesn't provide any way to reset the device. Once an assert has been triggered, there's no way to clear it and any further CUDA operation in that process will fail.

The obvious solution is not to run that test on CUDA.

What is the intended way of using this functionality? A priori one does not know the total number of pairs, right? I understand it is required, or at least useful, to have control of the maximum number of neighbors per particle from outside, but how does one use it in practice? In the past I have done things like: set 32 maximum neighbours, if building fails because it is too low increase by 32 until it no longer fails.

If something like that is the case here an extra parameter could be passed to choose whether or not to synchronize and check for a tooManyNeighbours error flag, to find the max number of neighbours as a precomputation. When constructing the CUDA graph this check would be omitted.

peastman commented 1 year ago

What I normally end up doing is to have some errorState array/value in device (or managed) memory.

That sounds like a good approach.

Let's merge this now and add error checking along those lines in a separate PR. That's going to require significant design to figure out an efficient mechanism for the error reporting.

raimis commented 1 year ago

@peastman let's merge this!

@RaulPPelaez could you open a dedicated issue to discuss and design the error check?

openmm / NNPOps

getNeighborPairs() supports periodic boundary conditions #70