ray-project / ray

Ray is an AI compute engine. Ray consists of a core distributed runtime and a set of AI Libraries for accelerating ML workloads.
https://ray.io
Apache License 2.0
34.15k stars 5.8k forks source link

[Core] some workloads were not able to be distributed on cluster. #30200

Open jungwoowoo opened 2 years ago

jungwoowoo commented 2 years ago

What happened + What you expected to happen

some workloads were not distributed on cluster.

logs : (f pid=18152) mean ? 0.4999771519540568 (f pid=18183) mean ? 0.4999771519540568 (f pid=18121) mean ? 0.4999771519540568 (f pid=18152) mean ? 0.4999771519540568 (f pid=18096) mean ? 0.4999771519540568 (f pid=18154) mean ? 0.4999771519540568 (f pid=18157) mean ? 0.4999771519540568 (f pid=18156) mean ? 0.4999771519540568 Traceback (most recent call last): File "ray_sample7.py", line 28, in results_from_ray = ray.get(object_ids) File "/home/medirita/anaconda2/envs/py36/lib/python3.6/site-packages/ray/_private/client_mode_hook.py", line 105, in wrapper return func(*args, **kwargs) File "/home/medirita/anaconda2/envs/py36/lib/python3.6/site-packages/ray/_private/worker.py", line 2275, in get raise value.as_instanceof_cause() ray.exceptions.RayTaskError: ray::f() (pid=1008834, ip=192.168.0.50) At least one of the input arguments for this task could not be computed: ray.exceptions.OwnerDiedError: Failed to retrieve object 00ffffffffffffffffffffffffffffffffffffff3202000001000000. To see information about where this ObjectRef was created in Python, set the environment variable RAY_record_ref_creation_sites=1 during ray start and ray.init().

The object's owner has exited. This is the Python worker that first created the ObjectRef via .remote() or ray.put(). Check cluster logs (/tmp/ray/session_latest/logs/*32020000ffffffffffffffffffffffffffffffffffffffffffffffff* at IP address 192.168.0.48) for more information about the Python worker failure. (f pid=18183) mean ? 0.4999771519540568 (f pid=18183) mean ? 0.4999771519540568

Versions / Dependencies

ray, version 2.0.0

Reproduction script

  1. sample code
import numpy as np
from collections import namedtuple
import ray
ray.init(
    address='auto',
    _node_ip_address="192.168.0.48"
)

callback_entry = namedtuple("callback_entry", ['index', 'value'])
result_list = []
def gathering_results_callback(entry):
    result_list[entry.index] = entry.value

@ray.remote
def f(G,temp_index):
    #G = np.random.randint(2, size=(N,N))
    temp = np.mean(G)
    print('mean ? ' , temp)
    return callback_entry(index = temp_index , value = temp)

result_list = np.zeros(400)

values = np.random.randint(2, size=(4848,4848))
ray_param1 = ray.put(values) 

object_ids = [f.remote(ray_param1 , temp_index) for temp_index in range(400)]

results_from_ray = ray.get(object_ids)

for entry in results_from_ray:
    gathering_results_callback(entry)

ray.shutdown()
  1. cluster yaml :
    
    # A unique identifier for the head node and workers of this cluster.
    cluster_name: ray_cluster_local

provider: type: local head_ip: 192.168.0.48 worker_ips: ['192.168.0.50']

How Ray will authenticate with newly launched nodes.

auth: ssh_user: cluser_user

min_workers: 1

max_workers: 1

upscaling_speed: 1.0

idle_timeout_minutes: 1

file_mounts: {

"/path1/on/remote/machine": "/path1/on/local/machine",

"/path2/on/remote/machine": "/path2/on/local/machine",

}

cluster_synced_files: []

Whether changes to directories in file_mounts or cluster_synced_files in the head node

should sync to the worker node continuously

file_mounts_sync_continuously: False

Patterns for files to exclude when running rsync up or rsync down

rsync_exclude:

rsync_filter:

List of commands that will be run before setup_commands. If docker is

enabled, these commands will run outside the container and before docker

is setup.

initialization_commands: []

List of shell commands to run to set up each nodes.

setup_commands: []

Custom commands that will be run on the head node after common setup.

head_setup_commands: []

Custom commands that will be run on worker nodes after common setup.

worker_setup_commands: []

Command to start ray on the head node. You don't need to change this.

head_start_ray_commands:

Global stats: 11 total (7 active) Queueing time: mean = 10.983 us, max = 75.020 us, min = 12.041 us, total = 120.813 us Execution time: mean = 19.145 us, total = 210.595 us Event stats: PeriodicalRunner.RunFnPeriodically - 5 total (3 active, 1 running), CPU time: mean = 3.391 us, total = 16.956 us InternalPubSubGcsService.grpc_client.GcsSubscriberPoll - 1 total (1 active), CPU time: mean = 0.000 s, total = 0.000 s InternalPubSubGcsService.grpc_client.GcsSubscriberCommandBatch - 1 total (0 active), CPU time: mean = 173.999 us, total = 173.999 us UNKNOWN - 1 total (1 active), CPU time: mean = 0.000 s, total = 0.000 s WorkerInfoGcsService.grpc_client.AddWorkerInfo - 1 total (0 active), CPU time: mean = 19.640 us, total = 19.640 us CoreWorker.deadline_timer.flush_profiling_events - 1 total (1 active), CPU time: mean = 0.000 s, total = 0.000 s NodeInfoGcsService.grpc_client.GetAllNodeInfo - 1 total (1 active), CPU time: mean = 0.000 s, total = 0.000 s

[2022-11-11 11:42:23,474 I 17751 17841] accessor.cc:608: Received notification for node id = 107cc5cbb54e68913101c222cdae0aa5c3bf5cd801a0507ea89cc71b, IsAlive = 1 [2022-11-11 11:42:23,474 I 17751 17841] accessor.cc:608: Received notification for node id = c9f66e4d351677dc27579cc654abe15fa96b3464bf946043eebb6517, IsAlive = 1 [2022-11-11 11:42:23,474 I 17751 17841] accessor.cc:608: Received notification for node id = 8b4a8b4ffcbc375e8fb82bdd7c42d9f70338657a52e05ed5a9ac72b4, IsAlive = 0 [2022-11-11 11:42:23,474 I 17751 17841] core_worker.cc:698: Node failure from 8b4a8b4ffcbc375e8fb82bdd7c42d9f70338657a52e05ed5a9ac72b4. All objects pinned on that node will be lost if object reconstruction is not enabled. [2022-11-11 11:42:23,474 I 17751 17841] accessor.cc:608: Received notification for node id = 1076ac16c7e6ed757b97f6056abb06856190b98d62e5d94c8d47264f, IsAlive = 0 [2022-11-11 11:42:23,474 I 17751 17841] core_worker.cc:698: Node failure from 1076ac16c7e6ed757b97f6056abb06856190b98d62e5d94c8d47264f. All objects pinned on that node will be lost if object reconstruction is not enabled. [2022-11-11 11:42:23,474 I 17751 17841] accessor.cc:608: Received notification for node id = b0d940d24be636850dbe9406620e3d493b523ae388690c580eb74ba7, IsAlive = 0 [2022-11-11 11:42:23,474 I 17751 17841] core_worker.cc:698: Node failure from b0d940d24be636850dbe9406620e3d493b523ae388690c580eb74ba7. All objects pinned on that node will be lost if object reconstruction is not enabled. [2022-11-11 11:42:23,474 I 17751 17841] accessor.cc:608: Received notification for node id = 404152238181c553b1ece96343953036f233028b2af56ea45b242d53, IsAlive = 0 [2022-11-11 11:42:23,474 I 17751 17841] core_worker.cc:698: Node failure from 404152238181c553b1ece96343953036f233028b2af56ea45b242d53. All objects pinned on that node will be lost if object reconstruction is not enabled. [2022-11-11 11:42:23,474 I 17751 17841] accessor.cc:608: Received notification for node id = d069e5aadb3360ea42be31c9adf139873b271c7d7c260b2494d40012, IsAlive = 0 [2022-11-11 11:42:23,474 I 17751 17841] core_worker.cc:698: Node failure from d069e5aadb3360ea42be31c9adf139873b271c7d7c260b2494d40012. All objects pinned on that node will be lost if object reconstruction is not enabled. [2022-11-11 11:42:23,474 I 17751 17841] accessor.cc:608: Received notification for node id = 1101ec680b1ccd9300d011ad95cc4a69988141c757f0027c1a8b162d, IsAlive = 0 [2022-11-11 11:42:23,474 I 17751 17841] core_worker.cc:698: Node failure from 1101ec680b1ccd9300d011ad95cc4a69988141c757f0027c1a8b162d. All objects pinned on that node will be lost if object reconstruction is not enabled. [2022-11-11 11:42:23,474 I 17751 17841] accessor.cc:608: Received notification for node id = 316f8bb690d125cdb866b87a4d47fd77e971e111a6a52e6ff8e8feb0, IsAlive = 0 [2022-11-11 11:42:23,474 I 17751 17841] core_worker.cc:698: Node failure from 316f8bb690d125cdb866b87a4d47fd77e971e111a6a52e6ff8e8feb0. All objects pinned on that node will be lost if object reconstruction is not enabled. [2022-11-11 11:42:23,474 I 17751 17841] accessor.cc:608: Received notification for node id = 168aedaa2153bf0d66d75918e8c6ef9197e051602afcfe61ac63b269, IsAlive = 0 [2022-11-11 11:42:23,474 I 17751 17841] core_worker.cc:698: Node failure from 168aedaa2153bf0d66d75918e8c6ef9197e051602afcfe61ac63b269. All objects pinned on that node will be lost if object reconstruction is not enabled. [2022-11-11 11:42:23,474 I 17751 17841] accessor.cc:608: Received notification for node id = 98f95e720be69976203a2fd4403fb8a0f444b6d487e80aa2e8650c9e, IsAlive = 0 [2022-11-11 11:42:23,474 I 17751 17841] core_worker.cc:698: Node failure from 98f95e720be69976203a2fd4403fb8a0f444b6d487e80aa2e8650c9e. All objects pinned on that node will be lost if object reconstruction is not enabled. . . . . [2022-11-11 11:42:23,487 I 17751 17841] accessor.cc:608: Received notification for node id = 75f7580e76928762d087929f88c123e9038557e6339217124dc32934, IsAlive = 0 [2022-11-11 11:42:23,487 I 17751 17841] core_worker.cc:698: Node failure from 75f7580e76928762d087929f88c123e9038557e6339217124dc32934. All objects pinned on that node will be lost if object reconstruction is not enabled. [2022-11-11 11:42:23,487 I 17751 17841] accessor.cc:608: Received notification for node id = 8bf9bbdae2fe9e4fc3b4711125b27040cd927ecca331c231e17e76cb, IsAlive = 0 [2022-11-11 11:42:23,487 I 17751 17841] core_worker.cc:698: Node failure from 8bf9bbdae2fe9e4fc3b4711125b27040cd927ecca331c231e17e76cb. All objects pinned on that node will be lost if object reconstruction is not enabled. [2022-11-11 11:42:23,487 I 17751 17841] accessor.cc:608: Received notification for node id = 797764d50df3b76df3f4508bd604b0b480b9c49c48d2a1a603bb8181, IsAlive = 0 [2022-11-11 11:42:23,487 I 17751 17841] core_worker.cc:698: Node failure from 797764d50df3b76df3f4508bd604b0b480b9c49c48d2a1a603bb8181. All objects pinned on that node will be lost if object reconstruction is not enabled. [2022-11-11 11:42:25,114 I 17751 17841] direct_task_transport.cc:264: Connecting to raylet c9f66e4d351677dc27579cc654abe15fa96b3464bf946043eebb6517 [2022-11-11 11:42:34,601 I 17751 17751] core_worker.cc:593: Disconnecting to the raylet. [2022-11-11 11:42:34,602 I 17751 17751] raylet_client.cc:163: RayletClient::Disconnect, exit_type=INTENDED_USER_EXIT, exit_detail=Shutdown by ray.shutdown()., has creation_task_exception_pb_bytes=0 [2022-11-11 11:42:34,602 I 17751 17751] core_worker.cc:540: Shutting down a core worker. [2022-11-11 11:42:34,602 I 17751 17751] core_worker.cc:564: Disconnecting a GCS client. [2022-11-11 11:42:34,602 I 17751 17751] core_worker.cc:568: Waiting for joining a core worker io thread. If it hangs here, there might be deadlock or a high load in the core worker io service. [2022-11-11 11:42:34,602 I 17751 17841] core_worker.cc:691: Core worker main io service stopped. [2022-11-11 11:42:34,602 I 17751 17751] core_worker.cc:577: Core worker ready to be deallocated. [2022-11-11 11:42:34,602 I 17751 17751] core_worker.cc:531: Core worker is destructed [2022-11-11 11:42:34,786 I 17751 17751] core_worker_process.cc:144: Destructing CoreWorkerProcessImpl. pid: 17751 [2022-11-11 11:42:34,786 I 17751 17751] io_service_pool.cc:47: IOServicePool is stopped.



### Issue Severity

High: It blocks me from completing my task.
ayl0407 commented 1 year ago

I am running into the same problem. Is there any workaround on this. Thanks!

scv119 commented 1 year ago

@cadedaniel (oncall) let me know if you have time to take a look?

cadedaniel commented 1 year ago

I will have time tomorrow to take a look.

stale[bot] commented 1 year ago

Hi, I'm a bot from the Ray team :)

To help human contributors to focus on more relevant issues, I will automatically add the stale label to issues that have had no activity for more than 4 months.

If there is no further activity in the 14 days, the issue will be closed!

You can always ask for help on our discussion forum or Ray's public slack channel.