ray-project / ray

Ray is an AI compute engine. Ray consists of a core distributed runtime and a set of AI Libraries for accelerating ML workloads.
https://ray.io
Apache License 2.0
34.26k stars 5.81k forks source link

False error while connecting to external Redis with TLS #46994

Open yashtc opened 3 months ago

yashtc commented 3 months ago

What happened + What you expected to happen

We are running a Ray cluster in Kuberneter (without the KubeRay operator). We have configured an external Redis instance for GCS fault tolerance. However, when we enable TLS on Redis, the Ray Head fails to init with an error Check failed: _s.ok() Bad status: RedisError: Success (full trace and logs below).

We believe this could be a false error thrown here. The reply from Redis seems to be SUCCESS but the code seems to considering anything other than OK to be an error.

const std::string status_str(redis_reply->str, redis_reply->len);
if (status_str == "OK") {
  status_reply_ = Status::OK();
} else {
  status_reply_ = Status::RedisError(status_str);
}

Setup details:

We are running Ray Serve version 2.9.2

Ray Head is started using the following command:

$> ray start --head --disable-usage-stats --port=6379 --num-cpus=0 --object-store-memory=200000000 --resources="{\"accelerator_type_cpu\": 2}" --dashboard-host=0.0.0.0 --metrics-export-port=9080 --dashboard-port=8265 --dashboard-agent-listen-port=52365 --redis-password=<redis_password> --block

We have configured the following env variables to enable SSL:

RAY_REDIS_ADDRESS=rediss://service-redis-master:6379
RAY_REDIS_ENABLE_SSL=True
RAY_REDIS_CA_CERT=/redis/certs/ca.crt

Below are the logs from Ray Head:

NUM_CPU is: 0
RAY_CUSTOM_RESOURCE is: {"accelerator_type_cpu": 2}
[2024-08-06 17:33:40,161 C 7 7] (ray_init) redis_context.cc:487:  Check failed: _s.ok() Bad status: RedisError: Success
*** StackTrace Information ***
/pyenv/versions/3.9.17/lib/python3.9/site-packages/ray/_raylet.so(+0xff295a) [0x7f2b3209e95a] ray::operator<<()
/pyenv/versions/3.9.17/lib/python3.9/site-packages/ray/_raylet.so(+0xff4217) [0x7f2b320a0217] ray::SpdLogMessage::Flush()
/pyenv/versions/3.9.17/lib/python3.9/site-packages/ray/_raylet.so(_ZN3ray6RayLogD1Ev+0x37) [0x7f2b320a06b7] ray::RayLog::~RayLog()
/pyenv/versions/3.9.17/lib/python3.9/site-packages/ray/_raylet.so(_ZN3ray3gcs12RedisContext7ConnectERKSsibS3_b+0x3d7) [0x7f2b3198fd77] ray::gcs::RedisContext::Connect()
/pyenv/versions/3.9.17/lib/python3.9/site-packages/ray/_raylet.so(_ZN3ray3gcs11RedisClient7ConnectESt6vectorIP23instrumented_io_contextSaIS4_EE+0x20f) [0x7f2b3198661f] ray::gcs::RedisClient::Connect()
/pyenv/versions/3.9.17/lib/python3.9/site-packages/ray/_raylet.so(_ZN3ray3gcs11RedisClient7ConnectER23instrumented_io_context+0xbc) [0x7f2b3198786c] ray::gcs::RedisClient::Connect()
/pyenv/versions/3.9.17/lib/python3.9/site-packages/ray/_raylet.so(_ZN3ray3gcs15RedisGetKeySyncERKSsiS2_bS2_S2_PSs+0x2ea) [0x7f2b3173909a] ray::gcs::RedisGetKeySync()
/pyenv/versions/3.9.17/lib/python3.9/site-packages/ray/_raylet.so(+0x68d9fe) [0x7f2b317399fe] __pyx_pw_3ray_7_raylet_31get_session_key_from_storage()
/pyenv/versions/3.9.17/lib/libpython3.9.so.1.0(+0x106653) [0x7f2b332a9653] cfunction_call
[2024-08-06 17:33:40,161 C 7 7] (ray_init) redis_context.cc:487:  Check failed: _s.ok() Bad status: RedisError: Success
*** StackTrace Information ***
/pyenv/versions/3.9.17/lib/python3.9/site-packages/ray/_raylet.so(+0xff295a) [0x7f2b3209e95a] ray::operator<<()
/pyenv/versions/3.9.17/lib/python3.9/site-packages/ray/_raylet.so(+0xff4217) [0x7f2b320a0217] ray::SpdLogMessage::Flush()
/pyenv/versions/3.9.17/lib/python3.9/site-packages/ray/_raylet.so(_ZN3ray6RayLogD1Ev+0x37) [0x7f2b320a06b7] ray::RayLog::~RayLog()
/pyenv/versions/3.9.17/lib/python3.9/site-packages/ray/_raylet.so(_ZN3ray3gcs12RedisContext7ConnectERKSsibS3_b+0x3d7) [0x7f2b3198fd77] ray::gcs::RedisContext::Connect()
/pyenv/versions/3.9.17/lib/python3.9/site-packages/ray/_raylet.so(_ZN3ray3gcs11RedisClient7ConnectESt6vectorIP23instrumented_io_contextSaIS4_EE+0x20f) [0x7f2b3198661f] ray::gcs::RedisClient::Connect()
/pyenv/versions/3.9.17/lib/python3.9/site-packages/ray/_raylet.so(_ZN3ray3gcs11RedisClient7ConnectER23instrumented_io_context+0xbc) [0x7f2b3198786c] ray::gcs::RedisClient::Connect()
/pyenv/versions/3.9.17/lib/python3.9/site-packages/ray/_raylet.so(_ZN3ray3gcs15RedisGetKeySyncERKSsiS2_bS2_S2_PSs+0x2ea) [0x7f2b3173909a] ray::gcs::RedisGetKeySync()
/pyenv/versions/3.9.17/lib/python3.9/site-packages/ray/_raylet.so(+0x68d9fe) [0x7f2b317399fe] __pyx_pw_3ray_7_raylet_31get_session_key_from_storage()

Versions / Dependencies

Ray Serve version 2.9.2

Reproduction script

RAY_REDIS_ADDRESS=rediss://service-redis-master:6379; RAY_REDIS_ENABLE_SSL=True; RAY_REDIS_CA_CERT=/redis/certs/ca.crt; ray start --head --disable-usage-stats --port=6379 --num-cpus=1 --object-store-memory=200000000 --resources="{\"accelerator_type_cpu\": 2}" --dashboard-host=0.0.0.0 --metrics-export-port=9080 --dashboard-port=8265 --dashboard-agent-listen-port=52365 --redis-password= --block

Issue Severity

High: It blocks me from completing my task.

anyscalesam commented 3 months ago

When does Redis reply with Success?