Closed RaulPPelaez closed 1 year ago
The Autograd class is not allowed to hold any state, right? As of now, the forward function must create and delete and event every time it is called. Any ideas on how to take event creation elsewhere?
Can't one add a host node for that?
El lun, 16 ene 2023 17:38, Raimondas Galvelis @.***> escribió:
@.**** commented on this pull request.
In src/pytorch/neighbors/getNeighborPairsCUDA.cu https://github.com/openmm/NNPOps/pull/80#discussion_r1071426626:
@@ -151,6 +166,13 @@ public: get_accessor<scalar_t, 2>(deltas), get_accessor<scalar_t, 1>(distances), get_accessor<scalar_t, 2>(box_vectors));
- cudaEventRecord(event, stream);
- cudaEventSynchronize(event);
- //Check the error flag
- TORCH_CHECK(tooManyNeighborsErrorFlag == 0, "Some particle has too many neighbours, found " +
This won't work with CUDA Graphs, because the CPU code will be executed only during a graph creation.
— Reply to this email directly, view it on GitHub https://github.com/openmm/NNPOps/pull/80#pullrequestreview-1250446716, or unsubscribe https://github.com/notifications/unsubscribe-auth/ADDJV4FYNFQWMUWWGSPJCRDWSV2RPANCNFSM6AAAAAAT45ST3A . You are receiving this because you authored the thread.Message ID: @.***>
What about moving the host side checking of the flag into the backward pass? There will usually be a lot of other kernels launched between the two, so waiting on the event at that point won't hurt performance. The disadvantage is that the check will be missed if the backward pass is skipped, for example if someone computes only energy but not forces.
What if a user runs just the forward pass?
Right, that's the disadvantage. It would give people error checking in most of the common use cases without significantly hurting performance. But there exist use cases where error checking would be skipped.
Since this is going to be part of a graph the check should go after launching the graph, and the entity doing so should be the one checking for errors in its execution. What function is going to be building and launching the graph?
An exception from inside a CUDA graph is problematic. One solution is triggering a cudaError during graph execution. For instance setting cudaErrorInvalidValue by calling cudaMemcpy(NULL, NULL, 0, 0). This would appear the next time the cuda error state is checked after launching the graph.
OTOH we could report the error in the two ways: A direct call to forward can check the host flag and launch and throw an exception. A cudaError can be raised if the call to the forward kernel is happening as part of a CUDA graph execution.
EDIT: One cannot call any cuda api function from inside a callback, so I do not know how to raise a cuda error.
This commit introduces a new optional bool flag, check_errors, to getNeighborPairs. The default (False) will check for errors and throw asynchronously, printing a meaningful message but crashing. Passing True will instead synchronize and raise the error synchronously, such that the exception can be handled. The same message is printed in both cases.
In case of a cuda graph the False option is forced, the error is thrown asynchronously, crashing the code with a meaningful message.
If False is chosen error checking is virtually free. There is no synchronization penalty and since the error flag lives in managed memory there should not be a mem transfer footprint at all if the error did not happen.
check_errors = False
): @RaulPPelaez solution is with cudaLaunchHostFunc
is very good.
check_errors = True
): I think, we getNeighborPairs
should just return the number of detected pairs and it is up to the user to check if there was enough space for space. This eliminates the cons of the regular mode by passing the burden of error checking to the users. The current approach with synchronization is only good for debugging, not production. The same effect can be achieved with the debug mode of PyTorch, where it runs the kernel synchronously.
- The same effect can be achieved with the debug mode of PyTorch, where it runs the kernel synchronously.
While this would make the error synchronous, it would be non catchable AFAIK. Do you know if Pytorch defines something that can be detected C++ side when using debug mode? That way I could make the exception synchronous AND catchable in debug mode only.
Also, I believe the meaning of check_errors should be the opposite as you wrote.
In both cases the kernel could write a especial value (say -1 or NaN) to, for instance neighbors[1][max_pairs-1]
when some particle has too many neighs. At least with the current format and strategy I think this can be implemented without overhead when there is no error.
I followed @raimis suggestion and added a bit of mine, ending up with the following: hree behaviors are possible:
Pros: -Default behavior is what is same as without this PR but CUDA-graph compatible. -User can choose to make exceptions recoverable. -Users can choose to manage errors themselves and keep things CUDA-graph compatible. Cons: -There are two parameters dealing with just error reporting. -No unit test can be written to test the default behavior, since it crashes the code without pytest being able to overcome it.
The current unit test for this functionality might help understand how it works:
def test_too_many_neighbors(device, dtype):
if not pt.cuda.is_available() and device == 'cuda':
pytest.skip('No GPU')
# 4 points result into 6 pairs, but there is a storage just for 4.
positions = pt.zeros((4, 3,), device=device, dtype=dtype)
with pytest.raises(RuntimeError):
# checkErrors = False will throw due to exceeding neighbours
# syncExceptions = True makes this exception catchable at the
# expense of performance (even when no error ocurred)
getNeighborPairs(positions, cutoff=1, max_num_neighbors=1, check_errors=False, sync_exceptions=True)
pt.cuda.synchronize()
# checkErrors = True will never throw due to exceeding neighbours,
# but will return the number of pairs found.
# syncExceptions is ignored in this case
neighbors, deltas, distances, number_found_pairs = getNeighborPairs(positions, cutoff=1, max_num_neighbors=1, check_errors=True)
assert number_found_pairs == 6
I made some changes to make getNeighborPairs CUDA-graph compatible, now one can do something like:
device = 'cuda'
dtype = pt.float32
num_atoms = 100
positions = 10 * pt.randn((num_atoms, 3), device=device, dtype=dtype)
cutoff = 5
graph = pt.cuda.CUDAGraph()
with pt.cuda.graph(graph):
neighbors, deltas, distances = getNeighborPairs(positions, cutoff=cutoff, max_num_neighbors=num_atoms*num_atoms)
graph.replay()
pt.cuda.synchronize()
@peastman @raimis , this is ready for review.
BTW performance is looking good for the CUDA graph version! This is time to call getNeighborPairs for a random cloud of particles with periodic boundary conditions and a density such that the number of neighbors is around 128.
You may review again @raimis
@peastman do you have any comments?
Does it really make sense to have separate arguments for check_errors
and sync_exceptions
? Is there ever a situation where it makes sense to pass different values for them? Error checking doesn't work correctly if sync_exceptions
is false, and it's completely ignored if error checking is disabled.
We decided to change the interface so that the number of pairs is always returned, meaning the user can now easily check if the maximum number of pairs is exceeded. This changes the restrictions of the original problem a bit.
We wanted the user to be informed (via an exception, for instance) in the case of the number of pairs found being larger than the maximum allowed.
Alas, informing the user in a recoverable way requires synchronizing (slow and incompatible with CUDA graphs), so I believe it is sensible that this functionality is guarded behind a flag.
Best we can do AFAIK is let the user choose between:
Right now we can do that with only the check_errors flag. If you guys are ok then I will remove sync_exceptions.
The only option I see if we really do not want the code to crash is to let the results be silently wrong when check_errors is false, passing onto the user the responsibility to check for the num_pairs return value.
In my opinion this function should not let the code progress further if num_pairs> maximum_neighbors. The user is probably not going to bother checking and the danger of being silently wrong is not tolerable. If the user wants to be able to do something about the exception (a.i. increase the maximum neighs) they can simply pass check_errors=True. Since recovering must be a conscious choice I believe this is a good behavior.
For me, the ideal use case for this function would be something as follows:
import torch as pt
from NNPOps.neighbors import getNeighborPairs
positions = pt.tensor(...)
max_num_neighbors = 1
# Find the maximum number of neighbors
while True:
try:
getNeighborPairs(positions, cutoff=3.0, max_num_neighbors=max_num_neighbors, check_errors=True)
except RuntimeError:
max_num_neighbors += 32
continue
break
# Fast and CUDA-graph compatible calls that should not ever raise, but will crash if an error occurs
neigh, deltas, distances, num_pairs = getNeighborPairs(positions, cutoff=3.0, max_num_neighbors=max_num_neighbors)
But we can also make it be something like this:
import torch as pt
from NNPOps.neighbors import getNeighborPairs
positions = pt.tensor(...)
max_num_neighbors = 1
# Find the maximum number of neighbors. This call will not ever raise, but be silently wrong.
neigh, deltas, distances, num_pairs = getNeighborPairs(positions, cutoff=3.0, max_num_neighbors=max_num_neighbors)
if num_pairs> max_num_neighbors:
max_num_neighbors = num_pairs + 32
# This will also never raise, also silently wrong. This call and the above will be fast and CUDA-graph compatible
neigh, deltas, distances, num_pairs = getNeighborPairs(positions, cutoff=3.0, max_num_neighbors=max_num_neighbors)
# This will raise if necessary, but not be CUDA-graph compatible.
neigh, deltas, distances, num_pairs = getNeighborPairs(positions, cutoff=3.0, max_num_neighbors=max_num_neighbors, check_errors=True)
Let me know what you think!
I don't think a crash is ever a good way of reporting an error. I would vote for combining the flags so you have a single option.
check_errors=True
: Slower but you get an exception if it exceeds the maximum number of neighbors.check_errors=False
: Faster but it's up to you to check the returned value to see if some neighbors were missed.Then check_errors=True would be the default. In your proposal only check_errors=False would be compatible with CUDA graphs, which is ok with me. OTOH it would be really inconvenient for the user to both put getNeighborPairs into a larger CUDA graph and ensure correctness. It is hard to check the number of pairs without synchronizing, which was part of the functionality added by this PR.
If you agree that the responsibility to check errors in CUDA graph mode should fall onto the user I will go ahead and implement @peastman 's proposal. cc @raimis
Then check_errors=True would be the default.
check_errors=False
would be the default.
In your proposal only check_errors=False would be compatible with CUDA graphs
Correct.
This simplifies the logic greatly, no kernel-side atomic error flag is required and the graph can be constructed without requiring a host node. I removed the sync_exceptions flag and followed @peastman 's proposal. The function never crashes the program. It is compatible with CUDA graphs by default, but can be silently wrong. The user must check that the returned num_pairs is lower than the provided max_num_neighbors. If check_errors=True, the function is not CUDA graph compatible because it requires copying the number of neighbors to host, but will raise if necessary.
@peastman @raimis this is ready for review again.
This is ready for review again.
Ping @peastman
This addresses all the issues I raised. Looks good to me now!
@RaulPPelaez I suppose, this is ready to merge?
This is ready for merge.
I have added a flag in managed memory. It will be atomically written if too many neighbors are found for some particle. It is checked using an event just after execution of the kernel.
All tests are passing (even the too_many_neighbors one in the GPU)