Closed peastman closed 1 year ago
I added the CUDA implementation. It mostly works, but test_neighbor_grads()
fails with an error I'm not sure of the best way to handle:
RuntimeError: function torch::autograd::CppNode
returned an incorrect number of gradients (expected 4, got 3)
Since I added box_vectors
as a fourth argument, autograd expects it to return the gradient with respect to that argument. But we don't calculate it, and I'm not sure it would even really make sense. Any gradient with respect to box vectors will be full of discontinuities.
Just return an empty tensor Tensor()
to indicated that the argument is not differentiable.
Thanks! I made the change and the test now passes.
When I run the complete TestNeighbors.py
suite, I still get errors in the CUDA version of test_periodic_neighbors()
:
RuntimeError: CUDA error: device-side assert triggered
After a while of debugging, I figured out it doesn't really have anything to do with that test. It's actually caused by test_too_many_neighbors()
, which intentionally triggers an assertion. The error condition somehow isn't getting cleared, such that all CUDA tests run after it always fail. You can observe this by simply adding the line pt.cuda.synchronize()
to the end of test_too_many_neighbors()
. That will cause it to always fail.
Any suggestions about what to do with test_too_many_neighbors()
? As far as I can tell torch.cuda
doesn't provide any way to reset the device. Once an assert has been triggered, there's no way to clear it and any further CUDA operation in that process will fail.
The obvious solution is not to run that test on CUDA.
One option is to call cudaResetDevice
using ctypes
(https://docs.python.org/3/library/ctypes.html).
If pytorch doesn't provide a safe way to reset the device, going behind its back to call a CUDA function directly will likely cause errors as well. That will invalidate all its existing handles to resources on the GPU, but it doesn't know they've been invalidated.
For the moment, I've limited that test to CPU. It's not ideal, but I don't have a better solution.
Each test is run in a separate process. So, after the device is reset PyTorch will follow with normal initialization for the next test.
You're welcome to see if you can figure out a way to get it to work. But in the mean time, let's not hold up a useful feature over a broken unit test that isn't even related to the new feature.
OK! Let's disable the test for now. What else is missing to finish this PR?
It's all ready for review.
Great! I'll look at it.
@RaulPPelaez how do you handle kernel errors in your code?
We need something to be:
AFAIK there is no clean way to assert with CUDA. As you mentioned device assert leaves the CUDA context in an unusable state. What I normally end up doing is to have some errorState array/value in device (or managed) memory. A thread in a kernel encountering an error atomically writes to this errorState and returns as fast as possible. Then you delay as much as possible checking this value for errors. For instance, if you at least have a record of this error state, the user can query it manually with some kind of checkErrorState() when he notices results are incorrect (unless the code just crashes, that is). I have never found a clean way to do this without requiring some kind of synchronization (like a device-host copy or a stream sync).
If you think about it, this is the way errors work in CUDA. You need to manually synchronize to query the current error state. e.g auto err = cudaDeviceSynchronize();
So if they have not figured out a better way...
Any suggestions about what to do with
test_too_many_neighbors()
? As far as I can telltorch.cuda
doesn't provide any way to reset the device. Once an assert has been triggered, there's no way to clear it and any further CUDA operation in that process will fail.The obvious solution is not to run that test on CUDA.
What is the intended way of using this functionality? A priori one does not know the total number of pairs, right? I understand it is required, or at least useful, to have control of the maximum number of neighbors per particle from outside, but how does one use it in practice? In the past I have done things like: set 32 maximum neighbours, if building fails because it is too low increase by 32 until it no longer fails.
If something like that is the case here an extra parameter could be passed to choose whether or not to synchronize and check for a tooManyNeighbours error flag, to find the max number of neighbours as a precomputation. When constructing the CUDA graph this check would be omitted.
What I normally end up doing is to have some errorState array/value in device (or managed) memory.
That sounds like a good approach.
Let's merge this now and add error checking along those lines in a separate PR. That's going to require significant design to figure out an efficient mechanism for the error reporting.
@peastman let's merge this!
@RaulPPelaez could you open a dedicated issue to discuss and design the error check?
I've implemented the CPU version but not the CUDA version so far. Please take a look and see if the API and implementation look OK.