Ray is a unified framework for scaling AI and Python applications. Ray consists of a core distributed runtime and a set of AI Libraries for accelerating ML workloads.
GcsClient is not managed well in CoreWorker which make the management of GCS connection hard. Right now, we have GcsClient in Pubsub/CoreWorker/Py. Although after migration to cpp based GCS client, the creation logic is randomly distributed in the cluster.
What happened + What you expected to happen
GcsClient is not managed well in CoreWorker which make the management of GCS connection hard. Right now, we have GcsClient in Pubsub/CoreWorker/Py. Although after migration to cpp based GCS client, the creation logic is randomly distributed in the cluster.
A hot fix is to create a singleton globally and reuse it https://github.com/ray-project/ray/pull/35624
This works in some way, but it doesn't gives us the flexibility to shutdown the GCS client channel.
A better way is to create the channel in a centralized way and pass this arounds to other endpoints.
Ideally, all GCS based service should be initialized in CoreWorker and just be passed to python with some cython API.
When CoreWorker is shutdown, we should just shutdown everything.
Related issue: https://github.com/ray-project/ray/issues/35681
Versions / Dependencies
master
Reproduction script
In code.
Issue Severity
None