alphaDev23 commented 1 month ago

What happened + What you expected to happen

See Runtime Error below. Occurred while using a remote function - @ray.remote(num_gpus=0.4).

Plasma memory usage 0 MiB, 2 objects, 0.0% full, 0.0% needed Objects consumed by Ray tasks: 82023 MiB.

======== Autoscaler status: 2024-07-28 12:01:05.073955 ======== Node status

Active: 1 node_3015b0c729b65fdb137d57bc4ab935f0d615d4cce0e5b97dd3e8f590 Pending: (no pending nodes) Recent failures: (no failures)

Resources

Usage: 0.0/96.0 CPU 0.0/2.0 GPU 0B/423.53GiB memory 0B/185.51GiB object_store_memory

Demands: (no resource demands)

I would expect the Owner's object not to have exited and the code not to have crashed. I hope there can be attention placed on this, and other similar issues; there appears to be numerous issues around failing to retrieve objects which causes long running processes (the reason the users use Ray) to crash.

``Runtime Error: Failed to retrieve object 81596b9f0eb02e80ffffffffffffffffffffffff0100000001000000. To see information about where this ObjectRef was created in Python, set the environment variable RAY_record_ref_creation_sites=1 duringray startandray.init()`.

The object's owner has exited. This is the Python worker that first created the ObjectRef via .remote() or ray.put(). Check cluster logs (/tmp/ray/session_latest/logs/*01000000ffffffffffffffffffffffffffffffffffffffffffffffff* at IP address 192.168.3.29) for more information about the Python worker failure. at get_objects raise value in worker.py: line 873 at get values, debugger_breakpoint = worker.get_objects(object_refs, timeout=timeout) ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^ in worker.py: line 2656 at wrapper return func(*args, *kwargs) ^^^^^^^^^^^^^^^^^^^^^ in client_mode_hook.py: line 103 at auto_init_wrapper return fn(args, **kwargs)```

From the python core driver logs, ```
Task Event stats:

IO Service Stats:

Global stats: 2341 total (1 active) Queueing time: mean = 34.166 us, max = 608.662 us, min = -0.000 s, total = 79.982 ms Execution time: mean = 503.312 us, total = 1.178 s Event stats: ray::rpc::TaskInfoGcsService.grpc_client.AddTaskEventData - 780 total (0 active), Execution time: mean = 1.046 ms, total = 815.523 ms, Queueing time: mean = 0.000 s, max = -0.000 s, min = 9223372036.855 s, total = 0.000 s CoreWorker.deadline_timer.flush_task_events - 780 total (1 active), Execution time: mean = 420.280 us, total = 327.818 ms, Queueing time: mean = 61.249 us, max = 608.662 us, min = -0.000 s, total = 47.774 ms ray::rpc::TaskInfoGcsService.grpc_client.AddTaskEventData.OnReplyReceived - 780 total (0 active), Execution time: mean = 43.884 us, total = 34.230 ms, Queueing time: mean = 41.272 us, max = 576.914 us, min = 12.741 us, total = 32.192 ms PeriodicalRunner.RunFnPeriodically - 1 total (0 active), Execution time: mean = 682.537 us, total = 682.537 us, Queueing time: mean = 15.881 us, max = 15.881 us, min = 15.881 us, total = 15.881 us Other Stats: grpc_in_progress:0 current number of task status events in buffer: 0 current number of profile events in buffer: 0 current number of dropped task attempts tracked: 0 total task events sent: 0.207569 MiB total number of task attempts sent: 970 total number of task attempts dropped reported: 0 total number of sent failure: 0 num status task events dropped: 0 num profile task events dropped: 0 ... ./logs/python-core-driver-01000000ffffffffffffffffffffffffffffffffffffffffffffffff_1.log:948:[2024-07-28 06:05:35,385 W 1 165] plasma_store_provider.cc:452: Objects f827e954f38e789affffffffffffffffffffffff0100000001000000, 81596b9f0eb02e80ffffffffffffffffffffffff0100000001000000 are still not local after 124s. If this message continues to print, ray.get() is likely hung. Please file an issue at https://github.com/ray-project/ray/issues/. ./logs/python-core-driver-01000000ffffffffffffffffffffffffffffffffffffffffffffffff_1.log:949:[2024-07-28 06:05:36,395 W 1 165] plasma_store_provider.cc:452: Objects f827e954f38e789affffffffffffffffffffffff0100000001000000, 81596b9f0eb02e80ffffffffffffffffffffffff0100000001000000 are still not local after 125s. If this message continues to print, ray.get() is likely hung. Please file an issue at https://github.com/ray-project/ray/issues/. ./logs/python-core-driver-01000000ffffffffffffffffffffffffffffffffffffffffffffffff_1.log:950:[2024-07-28 06:05:37,406 W 1 165] plasma_store_provider.cc:452: Objects f827e954f38e789affffffffffffffffffffffff0100000001000000, 81596b9f0eb02e80ffffffffffffffffffffffff0100000001000000 are still not local after 126s. If this message continues to print, ray.get() is likely hung. Please file an issue at https://github.com/ray-project/ray/issues/.```

Versions / Dependencies

Ubuntu 22.04 ray, version 2.32.0 Python 3.11.7

Reproduction script

Not practical to provide a reproducible script as the issue is intermittent. The object owners should not be exiting. However, here is pseudo code:

future = remote_train_step.remote() futures.append(future) result = ray.get(futures)

@ray.remote(num_gpus=0.4) def remote_train_step()

code

Issue Severity

High: It blocks me from completing my task.

jjyao commented 1 month ago

Hi @alphaDev23,

Based on the repro script, the owner of the futures are the driver process itself which shouldn't exit. How often does this happen? Happy to do a live pair debugging on this.

alphaDev23 commented 1 month ago

It happens consistently. Out of ~39,000 task executions on my last run, it happened 130 times. I would not expect this to occur at all.

alphaDev23 commented 1 month ago

Please advise.

jjyao commented 1 month ago

If this happens consistently, do you mind providing a repro that I can run and debug on my laptop?

Or you can upload the logs and I can take a look.

alphaDev23 commented 1 month ago

As mentioned in the OP, it's not practical to provide a repro. Which logs do you need?

alphaDev23 commented 1 month ago

Please advise.

alphaDev23 commented 2 weeks ago

Please advise.

jjyao commented 2 weeks ago

Hi @alphaDev23 sorry for missing the previous comments. Could you upload the entire /tmp/ray/session_latest/logs/

ray-project / ray

[Core] object can not be get because the owner exits #46828

What happened + What you expected to happen

======== Autoscaler status: 2024-07-28 12:01:05.073955 ======== Node status

Resources

Versions / Dependencies

Reproduction script

code

Issue Severity