Open AndyUB opened 1 day ago
Hi @AndyUB, Thanks for reporting!
This does sound like a bug in the overlap functionality. However, when I reran the test (python/ray/dag/tests/experimental/test_torch_tensor_dag.py::test_torch_tensor_exceptions[True-True-True-ray_start_regular0]
) 100 times on a node with L4 GPU, I wasn't able to reproduce it once.
Our CI is also a bit flaky right now (fix in progress) so it's not easy to check how often this failed.
In your experience how often this fails? Any way to make it more reproducible? You mentioned "An interesting finding is that when the code executes faster, this test always fails", how did you make the code run faster? On a better GPU?
btw, I think the issue is likely due to in _compute()
we only sync on GPU recv stream
, but not on CPU:
def _compute(
self,
overlap_gpu_communication: bool,
class_handle,
) -> bool:
input_data = self.reset_and_wait_intermediate_future()
def reset_and_wait_intermediate_future(self) -> Any:
future = self._intermediate_future
self._intermediate_future = None
return future.wait()
class GPUFuture(DAGOperationFuture[Any]):
def wait(self) -> Any:
"""
Wait for the future on the current CUDA stream and return the result from
the GPU operation. This operation does not block CPU.
"""
import cupy as cp
current_stream = cp.cuda.get_current_stream()
current_stream.wait_event(self._event)
return self._buf
And the receiver _compute()
operation runs the the following method (TorchTensorWorker.recv
), which directly retrieves the item, shape, and dtype from the GPU tensor without waiting.
class TorchTensorWorker:
def recv(self, tensor):
# Check that tensor got loaded to the correct device.
assert tensor.device == self.device
return (tensor[0].item(), tensor.shape, tensor.dtype)
To fix this issue, we will probably need to make CPU synchronize on the recv stream
in _compute()
.
cc: @stephanie-wang @rkooo567
Re: test repro, you could try to insert a sleep on the recv stream before queuing the recv.
And the receiver _compute() operation runs the the following method (TorchTensorWorker.recv), which directly retrieves the item, shape, and dtype from the GPU tensor without waiting. To fix this issue, we will probably need to make CPU synchronize on the recv stream in _compute().
Not sure that's the whole story, the read of the item requires GPU->CPU movement and is supposed to get queued on the compute stream after syncing on the recv stream. It would be good to check that the read of the item is happening on the expected stream.
What happened + What you expected to happen
The test
ray/python/ray/dag/tests/experimental/test_torch_tensor_dag.py::test_torch_tensor_exceptions[static_shape=True-direct_return=True-overlap_gpu_communication=True]
fails locally.The bug is probably reading the buffer allocated on the receiver side in NCCL P2P send/recv before the actual data is sent. Currently the receiver's buffer is all zeros so the output is all zeros. When I changed the allocation function to allocate
torch.ones(...) * 100
instead, the actual output becomes[100, ..., 100]
.An interesting finding is that when the code executes faster, this test always fails; but when I added a ton of print statements for debugging, it runs more slowly and the test sometimes passes.
Since this test has
overlap_gpu_communication=True
, it is likely related to overlapping GPU communication with computation. My guess is that the actor reading the tensor did not properly wait for the recv event to finish.I checked out to the commit that most recently modified the test: #47586, as well as the current HEAD of the
ray-project:master
branch, and the test failed in either case.Below is an example error message:
Versions / Dependencies
Newest version of Ray. Python: 3.9.
Reproduction script
https://github.com/ray-project/ray/blob/master/python/ray/dag/tests/experimental/test_torch_tensor_dag.py#L813
Issue Severity
High: It blocks me from completing my task.