Open robertnishihara opened 1 year ago
Posting this since I ran into it and in case we're dong something incorrectly here.
Output from running the gcs server directly
$ /home/runner/HightechUniqueProgrammingmacro/venv/lib/python3.10/site-packages/ray/core/src/ray/gcs/gcs_server
[2023-08-05 22:54:06,501 I 14610 14610] (gcs_server) io_service_pool.cc:35: IOServicePool is running with 1 io_service.
[2023-08-05 22:54:06,503 I 14610 14610] (gcs_server) gcs_server.cc:74: GCS storage type is StorageType::IN_MEMORY
[2023-08-05 22:54:06,504 I 14610 14610] (gcs_server) gcs_init_data.cc:44: Loading job table data.
[2023-08-05 22:54:06,504 I 14610 14610] (gcs_server) gcs_init_data.cc:56: Loading node table data.
[2023-08-05 22:54:06,504 I 14610 14610] (gcs_server) gcs_init_data.cc:68: Loading cluster resources table data.
[2023-08-05 22:54:06,504 I 14610 14610] (gcs_server) gcs_init_data.cc:95: Loading actor table data.
[2023-08-05 22:54:06,504 I 14610 14610] (gcs_server) gcs_init_data.cc:108: Loading actor task spec table data.
[2023-08-05 22:54:06,504 I 14610 14610] (gcs_server) gcs_init_data.cc:81: Loading placement group table data.
[2023-08-05 22:54:06,504 I 14610 14610] (gcs_server) gcs_init_data.cc:48: Finished loading job table data, size = 0
[2023-08-05 22:54:06,504 I 14610 14610] (gcs_server) gcs_init_data.cc:60: Finished loading node table data, size = 0
[2023-08-05 22:54:06,504 I 14610 14610] (gcs_server) gcs_init_data.cc:72: Finished loading cluster resources table data, size = 0
[2023-08-05 22:54:06,504 I 14610 14610] (gcs_server) gcs_init_data.cc:99: Finished loading actor table data, size = 0
[2023-08-05 22:54:06,504 I 14610 14610] (gcs_server) gcs_init_data.cc:112: Finished loading actor task spec table data, size = 0
[2023-08-05 22:54:06,504 I 14610 14610] (gcs_server) gcs_init_data.cc:86: Finished loading placement group table data, size = 0
[2023-08-05 22:54:06,504 I 14610 14610] (gcs_server) gcs_server.cc:164: No existing server cluster ID found. Generating new ID: 77040f672047927100a0eb82ccc1e2fd89eaeab9536593fcd89e37a2
[2023-08-05 22:54:06,505 I 14610 14610] (gcs_server) gcs_server.cc:658: Autoscaler V2 enabled: 0
[2023-08-05 22:54:06,507 I 14610 14610] (gcs_server) grpc_server.cc:129: GcsServer server started, listening on port 35893.
[2023-08-05 22:54:06,568 I 14610 14610] (gcs_server) gcs_server.cc:255: GcsNodeManager:
- RegisterNode request count: 0
- DrainNode request count: 0
- GetAllNodeInfo request count: 0
- GetInternalConfig request count: 0
GcsActorManager:
- RegisterActor request count: 0
- CreateActor request count: 0
- GetActorInfo request count: 0
- GetNamedActorInfo request count: 0
- GetAllActorInfo request count: 0
- KillActor request count: 0
- ListNamedActors request count: 0
- Registered actors count: 0
- Destroyed actors count: 0
- Named actors count: 0
- Unresolved actors count: 0
- Pending actors count: 0
- Created actors count: 0
- owners_: 0
- actor_to_register_callbacks_: 0
- actor_to_create_callbacks_: 0
- sorted_destroyed_actor_list_: 0
GcsResourceManager:
- GetResources request count: 0
- GetAllAvailableResources request count0
- ReportResourceUsage request count: 0
- GetAllResourceUsage request count: 0
GcsPlacementGroupManager:
- CreatePlacementGroup request count: 0
- RemovePlacementGroup request count: 0
- GetPlacementGroup request count: 0
- GetAllPlacementGroup request count: 0
- WaitPlacementGroupUntilReady request count: 0
- GetNamedPlacementGroup request count: 0
- Scheduling pending placement group count: 0
- Registered placement groups count: 0
- Named placement group count: 0
- Pending placement groups count: 0
- Infeasible placement groups count: 0
GcsPublisher {}
[runtime env manager] ID to URIs table:
[runtime env manager] URIs reference table:
GcsTaskManager:
-Total num task events reported: 0
-Total num status task events dropped: 0
-Total num profile events dropped: 0
-Total num bytes of task event stored: 0MiB
-Current num of task events stored: 0
-Total num of actor creation tasks: 0
-Total num of actor tasks: 0
-Total num of normal tasks: 0
-Total num of driver tasks: 0
[2023-08-05 22:54:06,568 I 14610 14610] (gcs_server) gcs_server.cc:872: Event stats:
Global stats: 30 total (16 active)
Queueing time: mean = 6.392 ms, max = 63.767 ms, min = 2.712 us, total = 191.748 ms
Execution time: mean = 2.130 ms, total = 63.912 ms
Event stats:
GcsInMemoryStore.GetAll - 6 total (0 active), CPU time: mean = 14.297 us, total = 85.780 us
InternalKVGcsService.grpc_client.InternalKVPut - 6 total (6 active), CPU time: mean = 0.000 s, total = 0.000 s
InternalKVGcsService.grpc_server.InternalKVPut - 6 total (4 active), CPU time: mean = 3.946 us, total = 23.673 us
GcsInMemoryStore.Put - 5 total (2 active), CPU time: mean = 12.756 ms, total = 63.779 ms
PeriodicalRunner.RunFnPeriodically - 4 total (2 active, 1 running), CPU time: mean = 1.605 us, total = 6.422 us
GcsInMemoryStore.Get - 1 total (0 active), CPU time: mean = 17.155 us, total = 17.155 us
UNKNOWN - 1 total (1 active), CPU time: mean = 0.000 s, total = 0.000 s
RayletLoadPulled - 1 total (1 active), CPU time: mean = 0.000 s, total = 0.000 s
[2023-08-05 22:54:06,568 I 14610 14610] (gcs_server) gcs_server.cc:873: GcsTaskManager Event stats:
Global stats: 0 total (0 active)
Queueing time: mean = -nan s, max = -0.000 s, min = 9223372036.855 s, total = 0.000 s
Execution time: mean = -nan s, total = 0.000 s
Event stats:
[2023-08-05 22:54:16,513 W 14610 14611] (gcs_server) metric_exporter.cc:212: [1] Export metrics to agent failed: GrpcUnavailable: RPC Error message: failed to connect to all addresses; RPC Error details: . This won't affect Ray, but you can lose metrics from the cluster.
[2023-08-05 22:55:06,569 I 14610 14610] (gcs_server) gcs_server.cc:255: GcsNodeManager:
- RegisterNode request count: 0
- DrainNode request count: 0
- GetAllNodeInfo request count: 0
- GetInternalConfig request count: 0
GcsActorManager:
- RegisterActor request count: 0
- CreateActor request count: 0
- GetActorInfo request count: 0
- GetNamedActorInfo request count: 0
- GetAllActorInfo request count: 0
- KillActor request count: 0
- ListNamedActors request count: 0
- Registered actors count: 0
- Destroyed actors count: 0
- Named actors count: 0
- Unresolved actors count: 0
- Pending actors count: 0
- Created actors count: 0
- owners_: 0
- actor_to_register_callbacks_: 0
- actor_to_create_callbacks_: 0
- sorted_destroyed_actor_list_: 0
GcsResourceManager:
- GetResources request count: 0
- GetAllAvailableResources request count0
- ReportResourceUsage request count: 0
- GetAllResourceUsage request count: 0
GcsPlacementGroupManager:
- CreatePlacementGroup request count: 0
- RemovePlacementGroup request count: 0
- GetPlacementGroup request count: 0
- GetAllPlacementGroup request count: 0
- WaitPlacementGroupUntilReady request count: 0
- GetNamedPlacementGroup request count: 0
- Scheduling pending placement group count: 0
- Registered placement groups count: 0
- Named placement group count: 0
- Pending placement groups count: 0
- Infeasible placement groups count: 0
GcsPublisher {}
[runtime env manager] ID to URIs table:
[runtime env manager] URIs reference table:
GcsTaskManager:
-Total num task events reported: 0
-Total num status task events dropped: 0
-Total num profile events dropped: 0
-Total num bytes of task event stored: 0MiB
-Current num of task events stored: 0
-Total num of actor creation tasks: 0
-Total num of actor tasks: 0
-Total num of normal tasks: 0
-Total num of driver tasks: 0
[2023-08-05 22:55:06,569 I 14610 14610] (gcs_server) gcs_server.cc:872: Event stats:
Global stats: 317 total (4 active)
Queueing time: mean = 670.068 us, max = 63.767 ms, min = 2.712 us, total = 212.411 ms
Execution time: mean = 215.844 us, total = 68.422 ms
Event stats:
GcsInMemoryStore.Put - 75 total (0 active), CPU time: mean = 863.273 us, total = 64.745 ms
InternalKVGcsService.grpc_client.InternalKVPut - 72 total (0 active), CPU time: mean = 11.649 us, total = 838.705 us
InternalKVGcsService.grpc_server.InternalKVPut - 72 total (0 active), CPU time: mean = 10.209 us, total = 735.036 us
RayletLoadPulled - 60 total (1 active), CPU time: mean = 6.621 us, total = 397.239 us
UNKNOWN - 20 total (1 active), CPU time: mean = 7.575 us, total = 151.510 us
GCSServer.deadline_timer.debug_state_dump - 6 total (1 active), CPU time: mean = 174.848 us, total = 1.049 ms
GcsInMemoryStore.GetAll - 6 total (0 active), CPU time: mean = 14.297 us, total = 85.780 us
PeriodicalRunner.RunFnPeriodically - 4 total (0 active), CPU time: mean = 100.618 us, total = 402.474 us
GcsInMemoryStore.Get - 1 total (0 active), CPU time: mean = 17.155 us, total = 17.155 us
GCSServer.deadline_timer.debug_state_event_stats_print - 1 total (1 active, 1 running), CPU time: mean = 0.000 s, total = 0.000 s
[2023-08-05 22:55:06,569 I 14610 14610] (gcs_server) gcs_server.cc:873: GcsTaskManager Event stats:
Global stats: 0 total (0 active)
Queueing time: mean = -nan s, max = -0.000 s, min = 9223372036.855 s, total = 0.000 s
Execution time: mean = -nan s, total = 0.000 s
Event stats:
@jjyao are you picking this up for Ray 2.9?
Reviewed - we weren't able to repro this. @jjyao to re-verify if true than can close.
it says this now:
/home/runner/raytest/.pythonlibs/lib/python3.10/site-packages/ray/core/src/ray/gcs/gcs_server: error while loading shared libraries: /home/runner/raytest/.pythonlibs/lib/python3.10/site-packages/ray/core/libjemalloc.so: cannot allocate memory in static TLS block
What happened + What you expected to happen
I tried running Ray inside of Replit
Replit installed Ray 2.6.2.
Running
ray.init()
in a Python interpreter leads to the GCS failing to startInterestingly, directly trying to start the GCS seems to work
Although doing it from Python doesn't
Versions / Dependencies
Ray 2.6.2 Python 3.10.8 Ubuntu 20.04.2
Reproduction script
Just
Issue Severity
None