ray-project / ray

Ray is an AI compute engine. Ray consists of a core distributed runtime and a set of AI Libraries for accelerating ML workloads.
https://ray.io
Apache License 2.0
34.12k stars 5.79k forks source link

[Core] Bad traceback on failure to reconnect to GCS server. #15235

Open clarkzinzow opened 3 years ago

clarkzinzow commented 3 years ago

What is the problem?

Upon a failed GCS client RPC, the client will attempt to reconnect to the GCS server. If that reconnection fails, the client will fatally log the GCS address and port. However, multiple such crashes have been reported in which the GCS address is the empty string, suggesting a bug in this logic:

2021-03-22T21:35:51Z o F0322 14:35:51.321081     9   315 service_based_gcs_client.cc:207] Couldn't reconnect to GCS server. The last attempted GCS server address was :0

Reproduction (REQUIRED)

TODO: If not an easy fix, get a minimal reproduction.

clarkzinzow commented 3 years ago

Not sure if this is an actual bug when Redis is not reachable: https://github.com/ray-project/ray/blob/be62444bc5924c61d69bb6aec62f967e531e768c/src/ray/gcs/gcs_client/service_based_gcs_client.cc#L112-L149

That should result in the GCS address being unset, i.e. an empty string.