jungwoowoo commented 2 years ago

What happened + What you expected to happen

some workloads were not distributed on cluster.

logs : (f pid=18152) mean ? 0.4999771519540568 (f pid=18183) mean ? 0.4999771519540568 (f pid=18121) mean ? 0.4999771519540568 (f pid=18152) mean ? 0.4999771519540568 (f pid=18096) mean ? 0.4999771519540568 (f pid=18154) mean ? 0.4999771519540568 (f pid=18157) mean ? 0.4999771519540568 (f pid=18156) mean ? 0.4999771519540568 Traceback (most recent call last): File "ray_sample7.py", line 28, in results_from_ray = ray.get(object_ids) File "/home/medirita/anaconda2/envs/py36/lib/python3.6/site-packages/ray/_private/client_mode_hook.py", line 105, in wrapper return func(*args, **kwargs) File "/home/medirita/anaconda2/envs/py36/lib/python3.6/site-packages/ray/_private/worker.py", line 2275, in get raise value.as_instanceof_cause() ray.exceptions.RayTaskError: ray::f() (pid=1008834, ip=192.168.0.50) At least one of the input arguments for this task could not be computed: ray.exceptions.OwnerDiedError: Failed to retrieve object 00ffffffffffffffffffffffffffffffffffffff3202000001000000. To see information about where this ObjectRef was created in Python, set the environment variable RAY_record_ref_creation_sites=1 during ray start and ray.init().

The object's owner has exited. This is the Python worker that first created the ObjectRef via .remote() or ray.put(). Check cluster logs (/tmp/ray/session_latest/logs/*32020000ffffffffffffffffffffffffffffffffffffffffffffffff* at IP address 192.168.0.48) for more information about the Python worker failure. (f pid=18183) mean ? 0.4999771519540568 (f pid=18183) mean ? 0.4999771519540568

Versions / Dependencies

ray, version 2.0.0

Reproduction script

sample code

import numpy as np
from collections import namedtuple
import ray
ray.init(
    address='auto',
    _node_ip_address="192.168.0.48"
)

callback_entry = namedtuple("callback_entry", ['index', 'value'])
result_list = []
def gathering_results_callback(entry):
    result_list[entry.index] = entry.value

@ray.remote
def f(G,temp_index):
    #G = np.random.randint(2, size=(N,N))
    temp = np.mean(G)
    print('mean ? ' , temp)
    return callback_entry(index = temp_index , value = temp)

result_list = np.zeros(400)

values = np.random.randint(2, size=(4848,4848))
ray_param1 = ray.put(values) 

object_ids = [f.remote(ray_param1 , temp_index) for temp_index in range(400)]

results_from_ray = ray.get(object_ids)

for entry in results_from_ray:
    gathering_results_callback(entry)

ray.shutdown()

cluster yaml :


# A unique identifier for the head node and workers of this cluster.
cluster_name: ray_cluster_local

provider: type: local head_ip: 192.168.0.48 worker_ips: ['192.168.0.50']

How Ray will authenticate with newly launched nodes.

auth: ssh_user: cluser_user

min_workers: 1

max_workers: 1

upscaling_speed: 1.0

idle_timeout_minutes: 1

file_mounts: {

"/path1/on/remote/machine": "/path1/on/local/machine",

"/path2/on/remote/machine": "/path2/on/local/machine",

}

cluster_synced_files: []

Whether changes to directories in file_mounts or cluster_synced_files in the head node

should sync to the worker node continuously

file_mounts_sync_continuously: False

Patterns for files to exclude when running rsync up or rsync down

rsync_exclude:

"**/.git"
"/.git/"

rsync_filter:

".gitignore"

List of commands that will be run before `setup_commands`. If docker is

enabled, these commands will run outside the container and before docker

is setup.

initialization_commands: []

List of shell commands to run to set up each nodes.

setup_commands: []

Custom commands that will be run on the head node after common setup.

head_setup_commands: []

Custom commands that will be run on worker nodes after common setup.

worker_setup_commands: []

Command to start ray on the head node. You don't need to change this.

head_start_ray_commands:

conda activate py36 && ray stop
conda activate py36 && ulimit -c unlimited && RAY_REDIS_ADDRESS=192.168.0.48:6000 RAY_scheduler_spread_threshold=0.0 ray start --port=6666 --head --resources '{"node0":2}' --node-ip-address="192.168.0.48" --include-dashboard=true --dashboard-host 192.168.0.48 --dashboard-port 8888 --num-cpus=28 --num-gpus=4 --memory=$((125*$((2**30)))) --redis-password=123456 --autoscaling-config=~/ray_bootstrap_config.yaml
Command to start ray on worker nodes. You don't need to change this.

worker_start_ray_commands:
conda activate py36 && ray stop
conda activate py36 && RAY_scheduler_spread_threshold=0.0 ray start --memory=$((40*$((2**30)))) --address=192.168.0.48:6666 --node-manager-port 40405 --object-manager-port 42015
```
3. cluster log (/tmp/ray/session_latest/logs/*32020000ffffffffffffffffffffffffffffffffffffffffffffffff*) :
```
[2022-11-11 11:42:23,454 I 17751 17751] core_worker_process.cc:107: Constructing CoreWorkerProcess. pid: 17751 [2022-11-11 11:42:23,459 I 17751 17751] grpc_server.cc:105: driver server started, listening on port 10061. [2022-11-11 11:42:23,463 I 17751 17751] core_worker.cc:185: Initializing worker at address: 192.168.0.48:10061, worker ID 32020000ffffffffffffffffffffffffffffffffffffffffffffffff, raylet 107cc5cbb54e68913101c222cdae0aa5c3bf5cd801a0507ea89cc71b [2022-11-11 11:42:23,465 I 17751 17751] io_service_pool.cc:35: IOServicePool is running with 1 io_service. [2022-11-11 11:42:23,465 I 17751 17841] core_worker.cc:476: Event stats:

Global stats: 11 total (7 active) Queueing time: mean = 10.983 us, max = 75.020 us, min = 12.041 us, total = 120.813 us Execution time: mean = 19.145 us, total = 210.595 us Event stats: PeriodicalRunner.RunFnPeriodically - 5 total (3 active, 1 running), CPU time: mean = 3.391 us, total = 16.956 us InternalPubSubGcsService.grpc_client.GcsSubscriberPoll - 1 total (1 active), CPU time: mean = 0.000 s, total = 0.000 s InternalPubSubGcsService.grpc_client.GcsSubscriberCommandBatch - 1 total (0 active), CPU time: mean = 173.999 us, total = 173.999 us UNKNOWN - 1 total (1 active), CPU time: mean = 0.000 s, total = 0.000 s WorkerInfoGcsService.grpc_client.AddWorkerInfo - 1 total (0 active), CPU time: mean = 19.640 us, total = 19.640 us CoreWorker.deadline_timer.flush_profiling_events - 1 total (1 active), CPU time: mean = 0.000 s, total = 0.000 s NodeInfoGcsService.grpc_client.GetAllNodeInfo - 1 total (1 active), CPU time: mean = 0.000 s, total = 0.000 s

[2022-11-11 11:42:23,474 I 17751 17841] accessor.cc:608: Received notification for node id = 107cc5cbb54e68913101c222cdae0aa5c3bf5cd801a0507ea89cc71b, IsAlive = 1 [2022-11-11 11:42:23,474 I 17751 17841] accessor.cc:608: Received notification for node id = c9f66e4d351677dc27579cc654abe15fa96b3464bf946043eebb6517, IsAlive = 1 [2022-11-11 11:42:23,474 I 17751 17841] accessor.cc:608: Received notification for node id = 8b4a8b4ffcbc375e8fb82bdd7c42d9f70338657a52e05ed5a9ac72b4, IsAlive = 0 [2022-11-11 11:42:23,474 I 17751 17841] core_worker.cc:698: Node failure from 8b4a8b4ffcbc375e8fb82bdd7c42d9f70338657a52e05ed5a9ac72b4. All objects pinned on that node will be lost if object reconstruction is not enabled. [2022-11-11 11:42:23,474 I 17751 17841] accessor.cc:608: Received notification for node id = 1076ac16c7e6ed757b97f6056abb06856190b98d62e5d94c8d47264f, IsAlive = 0 [2022-11-11 11:42:23,474 I 17751 17841] core_worker.cc:698: Node failure from 1076ac16c7e6ed757b97f6056abb06856190b98d62e5d94c8d47264f. All objects pinned on that node will be lost if object reconstruction is not enabled. [2022-11-11 11:42:23,474 I 17751 17841] accessor.cc:608: Received notification for node id = b0d940d24be636850dbe9406620e3d493b523ae388690c580eb74ba7, IsAlive = 0 [2022-11-11 11:42:23,474 I 17751 17841] core_worker.cc:698: Node failure from b0d940d24be636850dbe9406620e3d493b523ae388690c580eb74ba7. All objects pinned on that node will be lost if object reconstruction is not enabled. [2022-11-11 11:42:23,474 I 17751 17841] accessor.cc:608: Received notification for node id = 404152238181c553b1ece96343953036f233028b2af56ea45b242d53, IsAlive = 0 [2022-11-11 11:42:23,474 I 17751 17841] core_worker.cc:698: Node failure from 404152238181c553b1ece96343953036f233028b2af56ea45b242d53. All objects pinned on that node will be lost if object reconstruction is not enabled. [2022-11-11 11:42:23,474 I 17751 17841] accessor.cc:608: Received notification for node id = d069e5aadb3360ea42be31c9adf139873b271c7d7c260b2494d40012, IsAlive = 0 [2022-11-11 11:42:23,474 I 17751 17841] core_worker.cc:698: Node failure from d069e5aadb3360ea42be31c9adf139873b271c7d7c260b2494d40012. All objects pinned on that node will be lost if object reconstruction is not enabled. [2022-11-11 11:42:23,474 I 17751 17841] accessor.cc:608: Received notification for node id = 1101ec680b1ccd9300d011ad95cc4a69988141c757f0027c1a8b162d, IsAlive = 0 [2022-11-11 11:42:23,474 I 17751 17841] core_worker.cc:698: Node failure from 1101ec680b1ccd9300d011ad95cc4a69988141c757f0027c1a8b162d. All objects pinned on that node will be lost if object reconstruction is not enabled. [2022-11-11 11:42:23,474 I 17751 17841] accessor.cc:608: Received notification for node id = 316f8bb690d125cdb866b87a4d47fd77e971e111a6a52e6ff8e8feb0, IsAlive = 0 [2022-11-11 11:42:23,474 I 17751 17841] core_worker.cc:698: Node failure from 316f8bb690d125cdb866b87a4d47fd77e971e111a6a52e6ff8e8feb0. All objects pinned on that node will be lost if object reconstruction is not enabled. [2022-11-11 11:42:23,474 I 17751 17841] accessor.cc:608: Received notification for node id = 168aedaa2153bf0d66d75918e8c6ef9197e051602afcfe61ac63b269, IsAlive = 0 [2022-11-11 11:42:23,474 I 17751 17841] core_worker.cc:698: Node failure from 168aedaa2153bf0d66d75918e8c6ef9197e051602afcfe61ac63b269. All objects pinned on that node will be lost if object reconstruction is not enabled. [2022-11-11 11:42:23,474 I 17751 17841] accessor.cc:608: Received notification for node id = 98f95e720be69976203a2fd4403fb8a0f444b6d487e80aa2e8650c9e, IsAlive = 0 [2022-11-11 11:42:23,474 I 17751 17841] core_worker.cc:698: Node failure from 98f95e720be69976203a2fd4403fb8a0f444b6d487e80aa2e8650c9e. All objects pinned on that node will be lost if object reconstruction is not enabled. . . . . [2022-11-11 11:42:23,487 I 17751 17841] accessor.cc:608: Received notification for node id = 75f7580e76928762d087929f88c123e9038557e6339217124dc32934, IsAlive = 0 [2022-11-11 11:42:23,487 I 17751 17841] core_worker.cc:698: Node failure from 75f7580e76928762d087929f88c123e9038557e6339217124dc32934. All objects pinned on that node will be lost if object reconstruction is not enabled. [2022-11-11 11:42:23,487 I 17751 17841] accessor.cc:608: Received notification for node id = 8bf9bbdae2fe9e4fc3b4711125b27040cd927ecca331c231e17e76cb, IsAlive = 0 [2022-11-11 11:42:23,487 I 17751 17841] core_worker.cc:698: Node failure from 8bf9bbdae2fe9e4fc3b4711125b27040cd927ecca331c231e17e76cb. All objects pinned on that node will be lost if object reconstruction is not enabled. [2022-11-11 11:42:23,487 I 17751 17841] accessor.cc:608: Received notification for node id = 797764d50df3b76df3f4508bd604b0b480b9c49c48d2a1a603bb8181, IsAlive = 0 [2022-11-11 11:42:23,487 I 17751 17841] core_worker.cc:698: Node failure from 797764d50df3b76df3f4508bd604b0b480b9c49c48d2a1a603bb8181. All objects pinned on that node will be lost if object reconstruction is not enabled. [2022-11-11 11:42:25,114 I 17751 17841] direct_task_transport.cc:264: Connecting to raylet c9f66e4d351677dc27579cc654abe15fa96b3464bf946043eebb6517 [2022-11-11 11:42:34,601 I 17751 17751] core_worker.cc:593: Disconnecting to the raylet. [2022-11-11 11:42:34,602 I 17751 17751] raylet_client.cc:163: RayletClient::Disconnect, exit_type=INTENDED_USER_EXIT, exit_detail=Shutdown by ray.shutdown()., has creation_task_exception_pb_bytes=0 [2022-11-11 11:42:34,602 I 17751 17751] core_worker.cc:540: Shutting down a core worker. [2022-11-11 11:42:34,602 I 17751 17751] core_worker.cc:564: Disconnecting a GCS client. [2022-11-11 11:42:34,602 I 17751 17751] core_worker.cc:568: Waiting for joining a core worker io thread. If it hangs here, there might be deadlock or a high load in the core worker io service. [2022-11-11 11:42:34,602 I 17751 17841] core_worker.cc:691: Core worker main io service stopped. [2022-11-11 11:42:34,602 I 17751 17751] core_worker.cc:577: Core worker ready to be deallocated. [2022-11-11 11:42:34,602 I 17751 17751] core_worker.cc:531: Core worker is destructed [2022-11-11 11:42:34,786 I 17751 17751] core_worker_process.cc:144: Destructing CoreWorkerProcessImpl. pid: 17751 [2022-11-11 11:42:34,786 I 17751 17751] io_service_pool.cc:47: IOServicePool is stopped.



### Issue Severity

High: It blocks me from completing my task.

ayl0407 commented 1 year ago

I am running into the same problem. Is there any workaround on this. Thanks!

scv119 commented 1 year ago

@cadedaniel (oncall) let me know if you have time to take a look?

cadedaniel commented 1 year ago

I will have time tomorrow to take a look.

stale[bot] commented 1 year ago

Hi, I'm a bot from the Ray team :)

To help human contributors to focus on more relevant issues, I will automatically add the stale label to issues that have had no activity for more than 4 months.

If there is no further activity in the 14 days, the issue will be closed!

If you'd like to keep the issue open, just leave any comment, and the stale label will be removed!
If you'd like to get more attention to the issue, please tag one of Ray's contributors.

You can always ask for help on our discussion forum or Ray's public slack channel.

ray-project / ray

[Core] some workloads were not able to be distributed on cluster. #30200

What happened + What you expected to happen

Versions / Dependencies

Reproduction script

How Ray will authenticate with newly launched nodes.

"/path1/on/remote/machine": "/path1/on/local/machine",

"/path2/on/remote/machine": "/path2/on/local/machine",

Whether changes to directories in file_mounts or cluster_synced_files in the head node

should sync to the worker node continuously

Patterns for files to exclude when running rsync up or rsync down

List of commands that will be run before `setup_commands`. If docker is

enabled, these commands will run outside the container and before docker

is setup.

List of shell commands to run to set up each nodes.

Custom commands that will be run on the head node after common setup.

Custom commands that will be run on worker nodes after common setup.

Command to start ray on the head node. You don't need to change this.

Command to start ray on worker nodes. You don't need to change this.

ray-project / ray

[Core] some workloads were not able to be distributed on cluster. #30200

What happened + What you expected to happen

Versions / Dependencies

Reproduction script

How Ray will authenticate with newly launched nodes.

"/path1/on/remote/machine": "/path1/on/local/machine",

"/path2/on/remote/machine": "/path2/on/local/machine",

Whether changes to directories in file_mounts or cluster_synced_files in the head node

should sync to the worker node continuously

Patterns for files to exclude when running rsync up or rsync down

List of commands that will be run before setup_commands. If docker is

enabled, these commands will run outside the container and before docker

is setup.

List of shell commands to run to set up each nodes.

Custom commands that will be run on the head node after common setup.

Custom commands that will be run on worker nodes after common setup.

Command to start ray on the head node. You don't need to change this.

Command to start ray on worker nodes. You don't need to change this.

List of commands that will be run before `setup_commands`. If docker is