Open alphaDev23 opened 1 month ago
Hi @alphaDev23,
Based on the repro script, the owner of the futures
are the driver process itself which shouldn't exit. How often does this happen? Happy to do a live pair debugging on this.
It happens consistently. Out of ~39,000 task executions on my last run, it happened 130 times. I would not expect this to occur at all.
Please advise.
If this happens consistently, do you mind providing a repro that I can run and debug on my laptop?
Or you can upload the logs and I can take a look.
As mentioned in the OP, it's not practical to provide a repro. Which logs do you need?
Please advise.
Please advise.
Hi @alphaDev23 sorry for missing the previous comments. Could you upload the entire /tmp/ray/session_latest/logs/
What happened + What you expected to happen
See Runtime Error below. Occurred while using a remote function - @ray.remote(num_gpus=0.4).
Plasma memory usage 0 MiB, 2 objects, 0.0% full, 0.0% needed Objects consumed by Ray tasks: 82023 MiB.
======== Autoscaler status: 2024-07-28 12:01:05.073955 ======== Node status
Active: 1 node_3015b0c729b65fdb137d57bc4ab935f0d615d4cce0e5b97dd3e8f590 Pending: (no pending nodes) Recent failures: (no failures)
Resources
Usage: 0.0/96.0 CPU 0.0/2.0 GPU 0B/423.53GiB memory 0B/185.51GiB object_store_memory
Demands: (no resource demands)
I would expect the Owner's object not to have exited and the code not to have crashed. I hope there can be attention placed on this, and other similar issues; there appears to be numerous issues around failing to retrieve objects which causes long running processes (the reason the users use Ray) to crash.
``Runtime Error: Failed to retrieve object 81596b9f0eb02e80ffffffffffffffffffffffff0100000001000000. To see information about where this ObjectRef was created in Python, set the environment variable RAY_record_ref_creation_sites=1 during
ray startand
ray.init()`.The object's owner has exited. This is the Python worker that first created the ObjectRef via
.remote()
orray.put()
. Check cluster logs (/tmp/ray/session_latest/logs/*01000000ffffffffffffffffffffffffffffffffffffffffffffffff*
at IP address 192.168.3.29) for more information about the Python worker failure. at get_objects raise value in worker.py: line 873 at get values, debugger_breakpoint = worker.get_objects(object_refs, timeout=timeout) ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^ in worker.py: line 2656 at wrapper return func(*args, *kwargs) ^^^^^^^^^^^^^^^^^^^^^ in client_mode_hook.py: line 103 at auto_init_wrapper return fn(args, **kwargs)```IO Service Stats:
Global stats: 2341 total (1 active) Queueing time: mean = 34.166 us, max = 608.662 us, min = -0.000 s, total = 79.982 ms Execution time: mean = 503.312 us, total = 1.178 s Event stats: ray::rpc::TaskInfoGcsService.grpc_client.AddTaskEventData - 780 total (0 active), Execution time: mean = 1.046 ms, total = 815.523 ms, Queueing time: mean = 0.000 s, max = -0.000 s, min = 9223372036.855 s, total = 0.000 s CoreWorker.deadline_timer.flush_task_events - 780 total (1 active), Execution time: mean = 420.280 us, total = 327.818 ms, Queueing time: mean = 61.249 us, max = 608.662 us, min = -0.000 s, total = 47.774 ms ray::rpc::TaskInfoGcsService.grpc_client.AddTaskEventData.OnReplyReceived - 780 total (0 active), Execution time: mean = 43.884 us, total = 34.230 ms, Queueing time: mean = 41.272 us, max = 576.914 us, min = 12.741 us, total = 32.192 ms PeriodicalRunner.RunFnPeriodically - 1 total (0 active), Execution time: mean = 682.537 us, total = 682.537 us, Queueing time: mean = 15.881 us, max = 15.881 us, min = 15.881 us, total = 15.881 us Other Stats: grpc_in_progress:0 current number of task status events in buffer: 0 current number of profile events in buffer: 0 current number of dropped task attempts tracked: 0 total task events sent: 0.207569 MiB total number of task attempts sent: 970 total number of task attempts dropped reported: 0 total number of sent failure: 0 num status task events dropped: 0 num profile task events dropped: 0 ... ./logs/python-core-driver-01000000ffffffffffffffffffffffffffffffffffffffffffffffff_1.log:948:[2024-07-28 06:05:35,385 W 1 165] plasma_store_provider.cc:452: Objects f827e954f38e789affffffffffffffffffffffff0100000001000000, 81596b9f0eb02e80ffffffffffffffffffffffff0100000001000000 are still not local after 124s. If this message continues to print, ray.get() is likely hung. Please file an issue at https://github.com/ray-project/ray/issues/. ./logs/python-core-driver-01000000ffffffffffffffffffffffffffffffffffffffffffffffff_1.log:949:[2024-07-28 06:05:36,395 W 1 165] plasma_store_provider.cc:452: Objects f827e954f38e789affffffffffffffffffffffff0100000001000000, 81596b9f0eb02e80ffffffffffffffffffffffff0100000001000000 are still not local after 125s. If this message continues to print, ray.get() is likely hung. Please file an issue at https://github.com/ray-project/ray/issues/. ./logs/python-core-driver-01000000ffffffffffffffffffffffffffffffffffffffffffffffff_1.log:950:[2024-07-28 06:05:37,406 W 1 165] plasma_store_provider.cc:452: Objects f827e954f38e789affffffffffffffffffffffff0100000001000000, 81596b9f0eb02e80ffffffffffffffffffffffff0100000001000000 are still not local after 126s. If this message continues to print, ray.get() is likely hung. Please file an issue at https://github.com/ray-project/ray/issues/.```
Versions / Dependencies
Ubuntu 22.04 ray, version 2.32.0 Python 3.11.7
Reproduction script
Not practical to provide a reproducible script as the issue is intermittent. The object owners should not be exiting. However, here is pseudo code:
future = remote_train_step.remote() futures.append(future) result = ray.get(futures)
@ray.remote(num_gpus=0.4) def remote_train_step()
code
Issue Severity
High: It blocks me from completing my task.