RaulPPelaez commented 1 year ago

We need to design and integrate an efficient way to detect an error state in a CUDA kernel and capture it in the CPU. In particular detecting when some particle has more neighbours than the maximum allowed, replacing this: https://github.com/openmm/NNPOps/blob/8b2d427888e548dbb655b6233060e3e8038e92c2/src/pytorch/neighbors/getNeighborPairsCUDA.cu#L67 which currently leaves the CUDA context in an invalid state, requiring a full context reset.

Additionally, the strategy should be compatible with CUDA graphs. This is related to this PR https://github.com/openmm/NNPOps/pull/70

The main difficulty here is that there is no way to communicate information between a kernel and the CPU that does not involve a synchronization barrier and a memory copy.

I think we should go about this in a similar way as the native CUDA reporting goes, by somehow building an error checking function into the interface that is allowed to synchronize and memcpy.

The class building the list, here https://github.com/openmm/NNPOps/blob/8b2d427888e548dbb655b6233060e3e8038e92c2/src/pytorch/neighbors/getNeighborPairsCUDA.cu#L100 could own a device array/value storing error states (maybe an enum, or a simple integer), the function building the neighbour list would atomically set this error state instead of the assertion above.

Then, checking this error state in the CPU should be delayed as much as possible. For instance, before constructing a CUDA graph a series of error-checking calls to
https://github.com/openmm/NNPOps/blob/8b2d427888e548dbb655b6233060e3e8038e92c2/src/pytorch/neighbors/getNeighborPairsCUDA.cu#L102-L106 with increasing max_num_neighbours could be made to determine an upper bound for it. Then a graph is constructed in a way such that this error state is no longer automatically checked.

This has of course the downside that errors would go silent during a simulation, with the code crashing in an uncontrolled way.

RaulPPelaez commented 1 year ago

What is the intended usage of this function https://github.com/openmm/NNPOps/blob/8b2d427888e548dbb655b6233060e3e8038e92c2/src/pytorch/neighbors/getNeighborPairs.py#L5 when the maximum number of neighbours is not known beforehand?

Is the user expected to find out this number outside of the interface? be conservative?

I would expect the function to autocompute it by default or provide some way to inform and try again quickly if the provided max_num_neighbors is not enough.

peastman commented 1 year ago

I believe this is the most efficient way of communicating an error state back to the host.

Allocate the flag in host memory with cudaHostAlloc().
Clear the flag before launching the kernel.
If an error occurs, the kernel sets the flag.
On the host, record an event after launching the kernel.
To check for an error on the host, wait on the event and then check the flag.

For efficiency, it's best to delay that last step as long as possible. If you can launch one or more other kernels before waiting on the event, that will allow the GPU to keep working.

peastman commented 1 year ago

Is the user expected to find out this number outside of the interface? be conservative?

It expects you to just know it, which really isn't ideal. This is inherent in the neighbor list format it uses: a 2D tensor where one axis corresponds to the neighbors of each atom. It would be great if we could support a second neighbor list format that doesn't require the maximum neighbors to be fixed.

RaulPPelaez commented 1 year ago

Sounds good.

I believe this is the most efficient way of communicating an error state back to the host.
* Allocate the flag in host memory with `cudaHostAlloc()`.

* Clear the flag before launching the kernel.

* If an error occurs, the kernel sets the flag.
waiting on the event, that will allow the GPU to keep working.

If the flag is in managed memory, setting it in the host before the kernel launch should not incur in a memcpy or a sync if the kernel does not access the flag (no error), right?

* On the host, record an event after launching the kernel.

* To check for an error on the host, wait on the event and then check the flag.
For efficiency, it's best to delay that last step as long as possible. If you can launch one or more other kernels before There is then another question, when is the event synced?

At some point before leaving the function?

As the first step the next time the function is called?

The user must explicitly request the error state?

RaulPPelaez commented 1 year ago

Is the user expected to find out this number outside of the interface? be conservative?

It expects you to just know it, which really isn't ideal. This is inherent in the neighbor list format it uses: a 2D tensor where one axis corresponds to the neighbors of each atom. It would be great if we could support a second neighbor list format that doesn't require the maximum neighbors to be fixed.

Is it an actual requirement to know the shape of the tensor before calling? If it is not the getNeighborsPair function could just allocate whatever it needs. For instance, max_num_neighs=-2 could mean: figure out how many you need. In that case we could just do internally:

while(not tryToBuildList(..., max_num_neighs))
    max_num_neighs +=32;

RaulPPelaez commented 1 year ago

On the other hand, if the format can change wildly we could try a cell list, it is really fast to construct. To traverse, instead of having a "private" nlist for each particle, you get a list of particles in each cell and you need to go over the 27 neighbouring cells to go over your neighbours. This way you eliminate the need of max_num_neighs. The cell list is stored in three arrays, one of size N and two of size ncells=int(L/rc)^3, all of which are known in advance. The code gets messier, though.

peastman commented 1 year ago

The user must explicitly request the error state?

Users rarely check error states. If they have to take extra steps to see if there was an error, it's not a lot better than no error checking at all.

RaulPPelaez commented 1 year ago

80

RaulPPelaez commented 1 year ago

Closed as completed

openmm / NNPOps

Efficient error reporting in CUDA #79

80