Open alphaDev23 opened 1 month ago
@alphaDev23 How often does this happen? Can you post the worker log (/tmp/ray/session_latest/logs/*01000000ffffffffffffffffffffffffffffffffffffffffffffffff*
at IP address 192.168.3.29) to provide more information, maybe there is a stack trace or error messages in the log?
We will need more information to investigate.
Here the log,
[2024-08-08 01:55:49,096 I 1 161] core_worker_process.cc:107: Constructing CoreWorkerProcess. pid: 1
[2024-08-08 01:55:49,100 I 1 161] io_service_pool.cc:35: IOServicePool is running with 1 io_service.
[2024-08-08 01:55:50,661 I 1 161] grpc_server.cc:134: driver server started, listening on port 10002.
[2024-08-08 01:55:50,667 I 1 161] core_worker.cc:275: Initializing worker at address: 192.168.3.29:10002, worker ID 01000000ffffffffffffffffffffffffffffffffffffffffffffffff, raylet aa8d4dfcb325039dc1b336a9da374f502a2c3e16271aea10cc134e5a
[2024-08-08 01:55:50,668 I 1 161] task_event_buffer.cc:179: Reporting task events to GCS every 1000ms.
[2024-08-08 01:55:50,671 I 1 429] accessor.cc:714: Received notification for node id = aa8d4dfcb325039dc1b336a9da374f502a2c3e16271aea10cc134e5a, IsAlive = 1
[2024-08-08 01:55:50,671 I 1 429] core_worker.cc:4740: Number of alive nodes:1
[2024-08-08 01:55:50,671 I 1 429] accessor.cc:714: Received notification for node id = fcfe05c088cb6da849712b19055393c608e4b1271ae8f114195a88f0, IsAlive = 1
[2024-08-08 01:55:50,671 I 1 161] event.cc:234: Set ray event level to warning
[2024-08-08 01:55:50,671 I 1 161] event.cc:342: Ray Event initialized for CORE_WORKER
[2024-08-08 01:55:50,671 I 1 429] core_worker.cc:642: Event stats:
Global stats: 14 total (7 active)
Queueing time: mean = 11.257 us, max = 46.371 us, min = 26.940 us, total = 157.603 us
Execution time: mean = 262.685 us, total = 3.678 ms
Event stats:
PeriodicalRunner.RunFnPeriodically - 6 total (5 active, 1 running), Execution time: mean = 30.451 us, total = 182.704 us, Queueing time: mean = 7.729 us, max = 46.371 us, min = 46.371 us, total = 46.371 us
Publisher.CheckDeadSubscribers - 1 total (1 active), Execution time: mean = 0.000 s, total = 0.000 s, Queueing time: mean = 0.000 s, max = -0.000 s, min = 9223372036.855 s, total = 0.000 s
ray::rpc::WorkerInfoGcsService.grpc_client.AddWorkerInfo - 1 total (0 active), Execution time: mean = 1.432 ms, total = 1.432 ms, Queueing time: mean = 0.000 s, max = -0.000 s, min = 9223372036.855 s, total = 0.000 s
ray::rpc::InternalPubSubGcsService.grpc_client.GcsSubscriberPoll - 1 total (1 active), Execution time: mean = 0.000 s, total = 0.000 s, Queueing time: mean = 0.000 s, max = -0.000 s, min = 9223372036.855 s, total = 0.000 s
ray::rpc::NodeInfoGcsService.grpc_client.GetAllNodeInfo - 1 total (0 active), Execution time: mean = 708.186 us, total = 708.186 us, Queueing time: mean = 0.000 s, max = -0.000 s, min = 9223372036.855 s, total = 0.000 s
ray::rpc::WorkerInfoGcsService.grpc_client.AddWorkerInfo.OnReplyReceived - 1 total (0 active), Execution time: mean = 87.472 us, total = 87.472 us, Queueing time: mean = 26.940 us, max = 26.940 us, min = 26.940 us, total = 26.940 us
ray::rpc::NodeInfoGcsService.grpc_client.GetAllNodeInfo.OnReplyReceived - 1 total (0 active), Execution time: mean = 220.475 us, total = 220.475 us, Queueing time: mean = 42.451 us, max = 42.451 us, min = 42.451 us, total = 42.451 us
ray::rpc::InternalPubSubGcsService.grpc_client.GcsSubscriberCommandBatch.OnReplyReceived - 1 total (0 active), Execution time: mean = 385.639 us, total = 385.639 us, Queueing time: mean = 41.841 us, max = 41.841 us, min = 41.841 us, total = 41.841 us
ray::rpc::InternalPubSubGcsService.grpc_client.GcsSubscriberCommandBatch - 1 total (0 active), Execution time: mean = 661.555 us, total = 661.555 us, Queueing time: mean = 0.000 s, max = -0.000 s, min = 9223372036.855 s, total = 0.000 s
-----------------
Task Event stats:
IO Service Stats:
Global stats: 4 total (1 active)
Queueing time: mean = 8.393 us, max = 17.460 us, min = 16.110 us, total = 33.570 us
Execution time: mean = 393.499 us, total = 1.574 ms
Event stats:
ray::rpc::TaskInfoGcsService.grpc_client.AddTaskEventData.OnReplyReceived - 1 total (0 active), Execution time: mean = 69.032 us, total = 69.032 us, Queueing time: mean = 17.460 us, max = 17.460 us, min = 17.460 us, total = 17.460 us
PeriodicalRunner.RunFnPeriodically - 1 total (0 active), Execution time: mean = 653.685 us, total = 653.685 us, Queueing time: mean = 16.110 us, max = 16.110 us, min = 16.110 us, total = 16.110 us
ray::rpc::TaskInfoGcsService.grpc_client.AddTaskEventData - 1 total (0 active), Execution time: mean = 851.280 us, total = 851.280 us, Queueing time: mean = 0.000 s, max = -0.000 s, min = 9223372036.855 s, total = 0.000 s
CoreWorker.deadline_timer.flush_task_events - 1 total (1 active), Execution time: mean = 0.000 s, total = 0.000 s, Queueing time: mean = 0.000 s, max = -0.000 s, min = 9223372036.855 s, total = 0.000 s
Other Stats:
grpc_in_progress:0
current number of task status events in buffer: 1
current number of profile events in buffer: 0
current number of dropped task attempts tracked: 0
total task events sent: 0 MiB
total number of task attempts sent: 0
total number of task attempts dropped reported: 0
total number of sent failure: 0
num status task events dropped: 0
num profile task events dropped: 0
[2024-08-08 01:55:59,111 W 1 424] metric_exporter.cc:105: [1] Export metrics to agent failed: RpcError: RPC Error message: failed to connect to all addresses; last error: UNKNOWN: ipv4:127.0.0.1:61092: Failed to connect to remote host: Connection refused; RPC Error details: . This won't affect Ray, but you can lose metrics from the cluster.
[2024-08-08 01:56:31,204 I 1 429] direct_task_transport.cc:296: Connecting to raylet fcfe05c088cb6da849712b19055393c608e4b1271ae8f114195a88f0
[2024-08-08 01:56:50,672 I 1 429] core_worker.cc:642: Event stats:
Global stats: 905 total (9 active)
Queueing time: mean = 44.477 us, max = 1.600 ms, min = 12.719 us, total = 40.252 ms
Execution time: mean = 86.850 ms, total = 78.599 s
Event stats:
CoreWorker.RecoverObjects - 600 total (1 active), Execution time: mean = 8.523 us, total = 5.114 ms, Queueing time: mean = 46.090 us, max = 454.988 us, min = 12.719 us, total = 27.654 ms
CoreWorker.InternalHeartbeat - 60 total (1 active), Execution time: mean = 277.915 us, total = 16.675 ms, Queueing time: mean = 42.072 us, max = 257.954 us, min = 21.935 us, total = 2.524 ms
NodeManagerService.grpc_client.ReportWorkerBacklog.OnReplyReceived - 60 total (0 active), Execution time: mean = 31.331 us, total = 1.880 ms, Queueing time: mean = 38.568 us, max = 100.992 us, min = 21.000 us, total = 2.314 ms
NodeManagerService.grpc_client.ReportWorkerBacklog - 60 total (0 active), Execution time: mean = 945.367 us, total = 56.722 ms, Queueing time: mean = 0.000 s, max = -0.000 s, min = 9223372036.855 s, total = 0.000 s
NodeManagerService.grpc_client.RequestWorkerLease - 17 total (1 active), Execution time: mean = 3.495 s, total = 59.415 s, Queueing time: mean = 0.000 s, max = -0.000 s, min = 9223372036.855 s, total = 0.000 s
NodeManagerService.grpc_client.RequestWorkerLease.OnReplyReceived - 16 total (0 active), Execution time: mean = 797.742 us, total = 12.764 ms, Queueing time: mean = 41.601 us, max = 120.853 us, min = 17.900 us, total = 665.614 us
CoreWorker.RecordMetrics - 12 total (1 active), Execution time: mean = 53.260 us, total = 639.114 us, Queueing time: mean = 25.050 us, max = 37.374 us, min = 18.013 us, total = 300.601 us
CoreWorker.SubmitTask - 10 total (0 active), Execution time: mean = 555.652 us, total = 5.557 ms, Queueing time: mean = 137.713 us, max = 983.322 us, min = 28.980 us, total = 1.377 ms
CoreWorkerService.grpc_client.PushTask - 9 total (1 active), Execution time: mean = 2.117 s, total = 19.056 s, Queueing time: mean = 0.000 s, max = -0.000 s, min = 9223372036.855 s, total = 0.000 s
NodeManagerService.grpc_client.PinObjectIDs.OnReplyReceived - 8 total (0 active), Execution time: mean = 85.143 us, total = 681.146 us, Queueing time: mean = 47.826 us, max = 69.622 us, min = 24.550 us, total = 382.608 us
NodeManagerService.grpc_client.PinObjectIDs - 8 total (0 active), Execution time: mean = 1.261 ms, total = 10.087 ms, Queueing time: mean = 0.000 s, max = -0.000 s, min = 9223372036.855 s, total = 0.000 s
NodeManagerService.grpc_client.ReturnWorker - 8 total (0 active), Execution time: mean = 1.033 ms, total = 8.265 ms, Queueing time: mean = 0.000 s, max = -0.000 s, min = 9223372036.855 s, total = 0.000 s
CoreWorkerService.grpc_client.PushTask.OnReplyReceived - 8 total (0 active), Execution time: mean = 527.761 us, total = 4.222 ms, Queueing time: mean = 40.333 us, max = 55.761 us, min = 29.870 us, total = 322.666 us
NodeManagerService.grpc_client.ReturnWorker.OnReplyReceived - 8 total (0 active), Execution time: mean = 18.434 us, total = 147.474 us, Queueing time: mean = 32.866 us, max = 55.481 us, min = 23.531 us, total = 262.925 us
CoreWorker.GCStreamingGeneratorMetadata - 6 total (1 active), Execution time: mean = 4.083 us, total = 24.501 us, Queueing time: mean = 27.620 us, max = 35.962 us, min = 30.644 us, total = 165.722 us
PeriodicalRunner.RunFnPeriodically - 6 total (0 active), Execution time: mean = 302.184 us, total = 1.813 ms, Queueing time: mean = 695.153 us, max = 1.600 ms, min = 18.940 us, total = 4.171 ms
ray::rpc::InternalPubSubGcsService.grpc_client.GcsSubscriberCommandBatch.OnReplyReceived - 1 total (0 active), Execution time: mean = 385.639 us, total = 385.639 us, Queueing time: mean = 41.841 us, max = 41.841 us, min = 41.841 us, total = 41.841 us
ray::rpc::WorkerInfoGcsService.grpc_client.AddWorkerInfo.OnReplyReceived - 1 total (0 active), Execution time: mean = 87.472 us, total = 87.472 us, Queueing time: mean = 26.940 us, max = 26.940 us, min = 26.940 us, total = 26.940 us
ray::rpc::InternalPubSubGcsService.grpc_client.GcsSubscriberCommandBatch - 1 total (0 active), Execution time: mean = 661.555 us, total = 661.555 us, Queueing time: mean = 0.000 s, max = -0.000 s, min = 9223372036.855 s, total = 0.000 s
CoreWorker.PrintEventStats - 1 total (1 active, 1 running), Execution time: mean = 0.000 s, total = 0.000 s, Queueing time: mean = 0.000 s, max = -0.000 s, min = 9223372036.855 s, total = 0.000 s
ray::rpc::NodeInfoGcsService.grpc_client.GetAllNodeInfo - 1 total (0 active), Execution time: mean = 708.186 us, total = 708.186 us, Queueing time: mean = 0.000 s, max = -0.000 s, min = 9223372036.855 s, total = 0.000 s
Publisher.CheckDeadSubscribers - 1 total (1 active), Execution time: mean = 0.000 s, total = 0.000 s, Queueing time: mean = 0.000 s, max = -0.000 s, min = 9223372036.855 s, total = 0.000 s
ray::rpc::InternalPubSubGcsService.grpc_client.GcsSubscriberPoll - 1 total (1 active), Execution time: mean = 0.000 s, total = 0.000 s, Queueing time: mean = 0.000 s, max = -0.000 s, min = 9223372036.855 s, total = 0.000 s
ray::rpc::WorkerInfoGcsService.grpc_client.AddWorkerInfo - 1 total (0 active), Execution time: mean = 1.432 ms, total = 1.432 ms, Queueing time: mean = 0.000 s, max = -0.000 s, min = 9223372036.855 s, total = 0.000 s
ray::rpc::NodeInfoGcsService.grpc_client.GetAllNodeInfo.OnReplyReceived - 1 total (0 active), Execution time: mean = 220.475 us, total = 220.475 us, Queueing time: mean = 42.451 us, max = 42.451 us, min = 42.451 us, total = 42.451 us
-----------------
Task Event stats:
IO Service Stats:
Global stats: 181 total (1 active)
Queueing time: mean = 19.653 us, max = 47.535 us, min = 16.110 us, total = 3.557 ms
Execution time: mean = 414.400 us, total = 75.006 ms
Event stats:
ray::rpc::TaskInfoGcsService.grpc_client.AddTaskEventData.OnReplyReceived - 60 total (0 active), Execution time: mean = 39.095 us, total = 2.346 ms, Queueing time: mean = 27.441 us, max = 34.391 us, min = 17.460 us, total = 1.646 ms
ray::rpc::TaskInfoGcsService.grpc_client.AddTaskEventData - 60 total (0 active), Execution time: mean = 907.978 us, total = 54.479 ms, Queueing time: mean = 0.000 s, max = -0.000 s, min = 9223372036.855 s, total = 0.000 s
CoreWorker.deadline_timer.flush_task_events - 60 total (1 active), Execution time: mean = 292.140 us, total = 17.528 ms, Queueing time: mean = 31.576 us, max = 47.535 us, min = 20.082 us, total = 1.895 ms
PeriodicalRunner.RunFnPeriodically - 1 total (0 active), Execution time: mean = 653.685 us, total = 653.685 us, Queueing time: mean = 16.110 us, max = 16.110 us, min = 16.110 us, total = 16.110 us
Other Stats:
grpc_in_progress:0
current number of task status events in buffer: 1
current number of profile events in buffer: 0
current number of dropped task attempts tracked: 0
total task events sent: 0.0121832 MiB
total number of task attempts sent: 31
total number of task attempts dropped reported: 0
total number of sent failure: 0
num status task events dropped: 0
num profile task events dropped: 0
[2024-08-08 01:57:50,673 I 1 429] core_worker.cc:642: Event stats:
Global stats: 1917 total (10 active)
Queueing time: mean = 44.478 us, max = 2.068 ms, min = -0.001 s, total = 85.265 ms
Execution time: mean = 136.243 ms, total = 261.178 s
Event stats:
CoreWorker.RecoverObjects - 1200 total (1 active), Execution time: mean = 8.847 us, total = 10.617 ms, Queueing time: mean = 51.162 us, max = 2.068 ms, min = -0.001 s, total = 61.394 ms
CoreWorker.InternalHeartbeat - 120 total (1 active), Execution time: mean = 321.092 us, total = 38.531 ms, Queueing time: mean = 47.180 us, max = 257.954 us, min = 21.935 us, total = 5.662 ms
NodeManagerService.grpc_client.ReportWorkerBacklog.OnReplyReceived - 120 total (0 active), Execution time: mean = 29.241 us, total = 3.509 ms, Queueing time: mean = 40.255 us, max = 249.176 us, min = 19.471 us, total = 4.831 ms
NodeManagerService.grpc_client.ReportWorkerBacklog - 120 total (0 active), Execution time: mean = 975.939 us, total = 117.113 ms, Queueing time: mean = 0.000 s, max = -0.000 s, min = 9223372036.855 s, total = 0.000 s
NodeManagerService.grpc_client.RequestWorkerLease - 56 total (2 active), Execution time: mean = 3.674 s, total = 205.735 s, Queueing time: mean = 0.000 s, max = -0.000 s, min = 9223372036.855 s, total = 0.000 s
NodeManagerService.grpc_client.RequestWorkerLease.OnReplyReceived - 54 total (0 active), Execution time: mean = 748.419 us, total = 40.415 ms, Queueing time: mean = 39.787 us, max = 120.853 us, min = 17.900 us, total = 2.148 ms
CoreWorker.SubmitTask - 30 total (0 active), Execution time: mean = 465.793 us, total = 13.974 ms, Queueing time: mean = 87.100 us, max = 983.322 us, min = 28.980 us, total = 2.613 ms
NodeManagerService.grpc_client.PinObjectIDs.OnReplyReceived - 28 total (0 active), Execution time: mean = 81.911 us, total = 2.294 ms, Queueing time: mean = 42.575 us, max = 69.622 us, min = 24.121 us, total = 1.192 ms
NodeManagerService.grpc_client.PinObjectIDs - 28 total (0 active), Execution time: mean = 1.247 ms, total = 34.910 ms, Queueing time: mean = 0.000 s, max = -0.000 s, min = 9223372036.855 s, total = 0.000 s
CoreWorkerService.grpc_client.PushTask - 28 total (1 active), Execution time: mean = 1.969 s, total = 55.132 s, Queueing time: mean = 0.000 s, max = -0.000 s, min = 9223372036.855 s, total = 0.000 s
NodeManagerService.grpc_client.ReturnWorker - 27 total (0 active), Execution time: mean = 1.063 ms, total = 28.703 ms, Queueing time: mean = 0.000 s, max = -0.000 s, min = 9223372036.855 s, total = 0.000 s
CoreWorkerService.grpc_client.PushTask.OnReplyReceived - 27 total (0 active), Execution time: mean = 480.226 us, total = 12.966 ms, Queueing time: mean = 39.938 us, max = 61.812 us, min = 21.461 us, total = 1.078 ms
NodeManagerService.grpc_client.ReturnWorker.OnReplyReceived - 27 total (0 active), Execution time: mean = 22.061 us, total = 595.653 us, Queueing time: mean = 35.687 us, max = 55.481 us, min = 20.261 us, total = 963.562 us
CoreWorker.RecordMetrics - 24 total (1 active), Execution time: mean = 71.414 us, total = 1.714 ms, Queueing time: mean = 29.790 us, max = 55.356 us, min = 13.392 us, total = 714.959 us
CoreWorker.GCStreamingGeneratorMetadata - 12 total (1 active), Execution time: mean = 4.118 us, total = 49.411 us, Queueing time: mean = 29.552 us, max = 60.526 us, min = 19.380 us, total = 354.620 us
PeriodicalRunner.RunFnPeriodically - 6 total (0 active), Execution time: mean = 302.184 us, total = 1.813 ms, Queueing time: mean = 695.153 us, max = 1.600 ms, min = 18.940 us, total = 4.171 ms
CoreWorker.PrintEventStats - 2 total (1 active, 1 running), Execution time: mean = 328.132 us, total = 656.265 us, Queueing time: mean = 15.429 us, max = 30.858 us, min = 30.858 us, total = 30.858 us
ray::rpc::InternalPubSubGcsService.grpc_client.GcsSubscriberCommandBatch - 1 total (0 active), Execution time: mean = 661.555 us, total = 661.555 us, Queueing time: mean = 0.000 s, max = -0.000 s, min = 9223372036.855 s, total = 0.000 s
ray::rpc::InternalPubSubGcsService.grpc_client.GcsSubscriberCommandBatch.OnReplyReceived - 1 total (0 active), Execution time: mean = 385.639 us, total = 385.639 us, Queueing time: mean = 41.841 us, max = 41.841 us, min = 41.841 us, total = 41.841 us
ray::rpc::WorkerInfoGcsService.grpc_client.AddWorkerInfo.OnReplyReceived - 1 total (0 active), Execution time: mean = 87.472 us, total = 87.472 us, Queueing time: mean = 26.940 us, max = 26.940 us, min = 26.940 us, total = 26.940 us
ray::rpc::NodeInfoGcsService.grpc_client.GetAllNodeInfo - 1 total (0 active), Execution time: mean = 708.186 us, total = 708.186 us, Queueing time: mean = 0.000 s, max = -0.000 s, min = 9223372036.855 s, total = 0.000 s
Publisher.CheckDeadSubscribers - 1 total (1 active), Execution time: mean = 0.000 s, total = 0.000 s, Queueing time: mean = 0.000 s, max = -0.000 s, min = 9223372036.855 s, total = 0.000 s
ray::rpc::InternalPubSubGcsService.grpc_client.GcsSubscriberPoll - 1 total (1 active), Execution time: mean = 0.000 s, total = 0.000 s, Queueing time: mean = 0.000 s, max = -0.000 s, min = 9223372036.855 s, total = 0.000 s
ray::rpc::WorkerInfoGcsService.grpc_client.AddWorkerInfo - 1 total (0 active), Execution time: mean = 1.432 ms, total = 1.432 ms, Queueing time: mean = 0.000 s, max = -0.000 s, min = 9223372036.855 s, total = 0.000 s
ray::rpc::NodeInfoGcsService.grpc_client.GetAllNodeInfo.OnReplyReceived - 1 total (0 active), Execution time: mean = 220.475 us, total = 220.475 us, Queueing time: mean = 42.451 us, max = 42.451 us, min = 42.451 us, total = 42.451 us
-----------------
Task Event stats:
IO Service Stats:
Global stats: 361 total (1 active)
Queueing time: mean = 23.327 us, max = 130.442 us, min = 16.110 us, total = 8.421 ms
Execution time: mean = 423.129 us, total = 152.750 ms
Event stats:
ray::rpc::TaskInfoGcsService.grpc_client.AddTaskEventData.OnReplyReceived - 120 total (0 active), Execution time: mean = 41.170 us, total = 4.940 ms, Queueing time: mean = 31.522 us, max = 53.871 us, min = 17.460 us, total = 3.783 ms
ray::rpc::TaskInfoGcsService.grpc_client.AddTaskEventData - 120 total (0 active), Execution time: mean = 914.645 us, total = 109.757 ms, Queueing time: mean = 0.000 s, max = -0.000 s, min = 9223372036.855 s, total = 0.000 s
CoreWorker.deadline_timer.flush_task_events - 120 total (1 active), Execution time: mean = 311.651 us, total = 37.398 ms, Queueing time: mean = 38.520 us, max = 130.442 us, min = 20.082 us, total = 4.622 ms
PeriodicalRunner.RunFnPeriodically - 1 total (0 active), Execution time: mean = 653.685 us, total = 653.685 us, Queueing time: mean = 16.110 us, max = 16.110 us, min = 16.110 us, total = 16.110 us
Other Stats:
grpc_in_progress:0
current number of task status events in buffer: 0
current number of profile events in buffer: 0
current number of dropped task attempts tracked: 0
total task events sent: 0.0412445 MiB
total number of task attempts sent: 99
total number of task attempts dropped reported: 0
total number of sent failure: 0
num status task events dropped: 0
num profile task events dropped: 0
What happened + What you expected to happen
I received the error below when creating a remote actor task on a worker node. The actor is instantiated without issue, but when creating this remote task, inside of an actor, it fails before ray.get is called. The actor is decorated with
ray.remote
, the actor's method is not. When run on the head node, this issue does not occur.