Open jungwoowoo opened 2 years ago
I am running into the same problem. Is there any workaround on this. Thanks!
@cadedaniel (oncall) let me know if you have time to take a look?
I will have time tomorrow to take a look.
Hi, I'm a bot from the Ray team :)
To help human contributors to focus on more relevant issues, I will automatically add the stale label to issues that have had no activity for more than 4 months.
If there is no further activity in the 14 days, the issue will be closed!
You can always ask for help on our discussion forum or Ray's public slack channel.
What happened + What you expected to happen
some workloads were not distributed on cluster.
logs : (f pid=18152) mean ? 0.4999771519540568 (f pid=18183) mean ? 0.4999771519540568 (f pid=18121) mean ? 0.4999771519540568 (f pid=18152) mean ? 0.4999771519540568 (f pid=18096) mean ? 0.4999771519540568 (f pid=18154) mean ? 0.4999771519540568 (f pid=18157) mean ? 0.4999771519540568 (f pid=18156) mean ? 0.4999771519540568 Traceback (most recent call last): File "ray_sample7.py", line 28, in
results_from_ray = ray.get(object_ids)
File "/home/medirita/anaconda2/envs/py36/lib/python3.6/site-packages/ray/_private/client_mode_hook.py", line 105, in wrapper
return func(*args, **kwargs)
File "/home/medirita/anaconda2/envs/py36/lib/python3.6/site-packages/ray/_private/worker.py", line 2275, in get
raise value.as_instanceof_cause()
ray.exceptions.RayTaskError: ray::f() (pid=1008834, ip=192.168.0.50)
At least one of the input arguments for this task could not be computed:
ray.exceptions.OwnerDiedError: Failed to retrieve object 00ffffffffffffffffffffffffffffffffffffff3202000001000000. To see information about where this ObjectRef was created in Python, set the environment variable RAY_record_ref_creation_sites=1 during
ray start
andray.init()
.The object's owner has exited. This is the Python worker that first created the ObjectRef via
.remote()
orray.put()
. Check cluster logs (/tmp/ray/session_latest/logs/*32020000ffffffffffffffffffffffffffffffffffffffffffffffff*
at IP address 192.168.0.48) for more information about the Python worker failure. (f pid=18183) mean ? 0.4999771519540568 (f pid=18183) mean ? 0.4999771519540568Versions / Dependencies
ray, version 2.0.0
Reproduction script
provider: type: local head_ip: 192.168.0.48 worker_ips: ['192.168.0.50']
How Ray will authenticate with newly launched nodes.
auth: ssh_user: cluser_user
min_workers: 1
max_workers: 1
upscaling_speed: 1.0
idle_timeout_minutes: 1
file_mounts: {
"/path1/on/remote/machine": "/path1/on/local/machine",
"/path2/on/remote/machine": "/path2/on/local/machine",
}
cluster_synced_files: []
Whether changes to directories in file_mounts or cluster_synced_files in the head node
should sync to the worker node continuously
file_mounts_sync_continuously: False
Patterns for files to exclude when running rsync up or rsync down
rsync_exclude:
rsync_filter:
List of commands that will be run before
setup_commands
. If docker isenabled, these commands will run outside the container and before docker
is setup.
initialization_commands: []
List of shell commands to run to set up each nodes.
setup_commands: []
Custom commands that will be run on the head node after common setup.
head_setup_commands: []
Custom commands that will be run on worker nodes after common setup.
worker_setup_commands: []
Command to start ray on the head node. You don't need to change this.
head_start_ray_commands:
Command to start ray on worker nodes. You don't need to change this.
worker_start_ray_commands:
[2022-11-11 11:42:23,454 I 17751 17751] core_worker_process.cc:107: Constructing CoreWorkerProcess. pid: 17751 [2022-11-11 11:42:23,459 I 17751 17751] grpc_server.cc:105: driver server started, listening on port 10061. [2022-11-11 11:42:23,463 I 17751 17751] core_worker.cc:185: Initializing worker at address: 192.168.0.48:10061, worker ID 32020000ffffffffffffffffffffffffffffffffffffffffffffffff, raylet 107cc5cbb54e68913101c222cdae0aa5c3bf5cd801a0507ea89cc71b [2022-11-11 11:42:23,465 I 17751 17751] io_service_pool.cc:35: IOServicePool is running with 1 io_service. [2022-11-11 11:42:23,465 I 17751 17841] core_worker.cc:476: Event stats:
Global stats: 11 total (7 active) Queueing time: mean = 10.983 us, max = 75.020 us, min = 12.041 us, total = 120.813 us Execution time: mean = 19.145 us, total = 210.595 us Event stats: PeriodicalRunner.RunFnPeriodically - 5 total (3 active, 1 running), CPU time: mean = 3.391 us, total = 16.956 us InternalPubSubGcsService.grpc_client.GcsSubscriberPoll - 1 total (1 active), CPU time: mean = 0.000 s, total = 0.000 s InternalPubSubGcsService.grpc_client.GcsSubscriberCommandBatch - 1 total (0 active), CPU time: mean = 173.999 us, total = 173.999 us UNKNOWN - 1 total (1 active), CPU time: mean = 0.000 s, total = 0.000 s WorkerInfoGcsService.grpc_client.AddWorkerInfo - 1 total (0 active), CPU time: mean = 19.640 us, total = 19.640 us CoreWorker.deadline_timer.flush_profiling_events - 1 total (1 active), CPU time: mean = 0.000 s, total = 0.000 s NodeInfoGcsService.grpc_client.GetAllNodeInfo - 1 total (1 active), CPU time: mean = 0.000 s, total = 0.000 s
[2022-11-11 11:42:23,474 I 17751 17841] accessor.cc:608: Received notification for node id = 107cc5cbb54e68913101c222cdae0aa5c3bf5cd801a0507ea89cc71b, IsAlive = 1 [2022-11-11 11:42:23,474 I 17751 17841] accessor.cc:608: Received notification for node id = c9f66e4d351677dc27579cc654abe15fa96b3464bf946043eebb6517, IsAlive = 1 [2022-11-11 11:42:23,474 I 17751 17841] accessor.cc:608: Received notification for node id = 8b4a8b4ffcbc375e8fb82bdd7c42d9f70338657a52e05ed5a9ac72b4, IsAlive = 0 [2022-11-11 11:42:23,474 I 17751 17841] core_worker.cc:698: Node failure from 8b4a8b4ffcbc375e8fb82bdd7c42d9f70338657a52e05ed5a9ac72b4. All objects pinned on that node will be lost if object reconstruction is not enabled. [2022-11-11 11:42:23,474 I 17751 17841] accessor.cc:608: Received notification for node id = 1076ac16c7e6ed757b97f6056abb06856190b98d62e5d94c8d47264f, IsAlive = 0 [2022-11-11 11:42:23,474 I 17751 17841] core_worker.cc:698: Node failure from 1076ac16c7e6ed757b97f6056abb06856190b98d62e5d94c8d47264f. All objects pinned on that node will be lost if object reconstruction is not enabled. [2022-11-11 11:42:23,474 I 17751 17841] accessor.cc:608: Received notification for node id = b0d940d24be636850dbe9406620e3d493b523ae388690c580eb74ba7, IsAlive = 0 [2022-11-11 11:42:23,474 I 17751 17841] core_worker.cc:698: Node failure from b0d940d24be636850dbe9406620e3d493b523ae388690c580eb74ba7. All objects pinned on that node will be lost if object reconstruction is not enabled. [2022-11-11 11:42:23,474 I 17751 17841] accessor.cc:608: Received notification for node id = 404152238181c553b1ece96343953036f233028b2af56ea45b242d53, IsAlive = 0 [2022-11-11 11:42:23,474 I 17751 17841] core_worker.cc:698: Node failure from 404152238181c553b1ece96343953036f233028b2af56ea45b242d53. All objects pinned on that node will be lost if object reconstruction is not enabled. [2022-11-11 11:42:23,474 I 17751 17841] accessor.cc:608: Received notification for node id = d069e5aadb3360ea42be31c9adf139873b271c7d7c260b2494d40012, IsAlive = 0 [2022-11-11 11:42:23,474 I 17751 17841] core_worker.cc:698: Node failure from d069e5aadb3360ea42be31c9adf139873b271c7d7c260b2494d40012. All objects pinned on that node will be lost if object reconstruction is not enabled. [2022-11-11 11:42:23,474 I 17751 17841] accessor.cc:608: Received notification for node id = 1101ec680b1ccd9300d011ad95cc4a69988141c757f0027c1a8b162d, IsAlive = 0 [2022-11-11 11:42:23,474 I 17751 17841] core_worker.cc:698: Node failure from 1101ec680b1ccd9300d011ad95cc4a69988141c757f0027c1a8b162d. All objects pinned on that node will be lost if object reconstruction is not enabled. [2022-11-11 11:42:23,474 I 17751 17841] accessor.cc:608: Received notification for node id = 316f8bb690d125cdb866b87a4d47fd77e971e111a6a52e6ff8e8feb0, IsAlive = 0 [2022-11-11 11:42:23,474 I 17751 17841] core_worker.cc:698: Node failure from 316f8bb690d125cdb866b87a4d47fd77e971e111a6a52e6ff8e8feb0. All objects pinned on that node will be lost if object reconstruction is not enabled. [2022-11-11 11:42:23,474 I 17751 17841] accessor.cc:608: Received notification for node id = 168aedaa2153bf0d66d75918e8c6ef9197e051602afcfe61ac63b269, IsAlive = 0 [2022-11-11 11:42:23,474 I 17751 17841] core_worker.cc:698: Node failure from 168aedaa2153bf0d66d75918e8c6ef9197e051602afcfe61ac63b269. All objects pinned on that node will be lost if object reconstruction is not enabled. [2022-11-11 11:42:23,474 I 17751 17841] accessor.cc:608: Received notification for node id = 98f95e720be69976203a2fd4403fb8a0f444b6d487e80aa2e8650c9e, IsAlive = 0 [2022-11-11 11:42:23,474 I 17751 17841] core_worker.cc:698: Node failure from 98f95e720be69976203a2fd4403fb8a0f444b6d487e80aa2e8650c9e. All objects pinned on that node will be lost if object reconstruction is not enabled. . . . . [2022-11-11 11:42:23,487 I 17751 17841] accessor.cc:608: Received notification for node id = 75f7580e76928762d087929f88c123e9038557e6339217124dc32934, IsAlive = 0 [2022-11-11 11:42:23,487 I 17751 17841] core_worker.cc:698: Node failure from 75f7580e76928762d087929f88c123e9038557e6339217124dc32934. All objects pinned on that node will be lost if object reconstruction is not enabled. [2022-11-11 11:42:23,487 I 17751 17841] accessor.cc:608: Received notification for node id = 8bf9bbdae2fe9e4fc3b4711125b27040cd927ecca331c231e17e76cb, IsAlive = 0 [2022-11-11 11:42:23,487 I 17751 17841] core_worker.cc:698: Node failure from 8bf9bbdae2fe9e4fc3b4711125b27040cd927ecca331c231e17e76cb. All objects pinned on that node will be lost if object reconstruction is not enabled. [2022-11-11 11:42:23,487 I 17751 17841] accessor.cc:608: Received notification for node id = 797764d50df3b76df3f4508bd604b0b480b9c49c48d2a1a603bb8181, IsAlive = 0 [2022-11-11 11:42:23,487 I 17751 17841] core_worker.cc:698: Node failure from 797764d50df3b76df3f4508bd604b0b480b9c49c48d2a1a603bb8181. All objects pinned on that node will be lost if object reconstruction is not enabled. [2022-11-11 11:42:25,114 I 17751 17841] direct_task_transport.cc:264: Connecting to raylet c9f66e4d351677dc27579cc654abe15fa96b3464bf946043eebb6517 [2022-11-11 11:42:34,601 I 17751 17751] core_worker.cc:593: Disconnecting to the raylet. [2022-11-11 11:42:34,602 I 17751 17751] raylet_client.cc:163: RayletClient::Disconnect, exit_type=INTENDED_USER_EXIT, exit_detail=Shutdown by ray.shutdown()., has creation_task_exception_pb_bytes=0 [2022-11-11 11:42:34,602 I 17751 17751] core_worker.cc:540: Shutting down a core worker. [2022-11-11 11:42:34,602 I 17751 17751] core_worker.cc:564: Disconnecting a GCS client. [2022-11-11 11:42:34,602 I 17751 17751] core_worker.cc:568: Waiting for joining a core worker io thread. If it hangs here, there might be deadlock or a high load in the core worker io service. [2022-11-11 11:42:34,602 I 17751 17841] core_worker.cc:691: Core worker main io service stopped. [2022-11-11 11:42:34,602 I 17751 17751] core_worker.cc:577: Core worker ready to be deallocated. [2022-11-11 11:42:34,602 I 17751 17751] core_worker.cc:531: Core worker is destructed [2022-11-11 11:42:34,786 I 17751 17751] core_worker_process.cc:144: Destructing CoreWorkerProcessImpl. pid: 17751 [2022-11-11 11:42:34,786 I 17751 17751] io_service_pool.cc:47: IOServicePool is stopped.