ray-project / ray

Ray is an AI compute engine. Ray consists of a core distributed runtime and a set of AI Libraries for accelerating ML workloads.
https://ray.io
Apache License 2.0
34.16k stars 5.8k forks source link

[Bug] GCS fault tolerance unable to validate SSL certificates #41161

Open marco-aws opened 1 year ago

marco-aws commented 1 year ago

What happened + What you expected to happen

We are trying to configure KubeRay with an external Redis instance (AWS Elasticache) but I'm receiving SSL errors when AUTH is enabled:

[2023-11-13 01:47:48,664 C 8 8] (ray_init) redis_context.cc:484:  Check failed: redisInitiateSSLWithContext(context_, ssl_context_) == REDIS_OK Failed to setup encrypted redis: SSL_connect failed: CERTIFICATE_VERIFY_FAILED

Versions / Dependencies

Helm values:

head:
[...]
  rayStartParams:
    redis-password: 'edited'
    dashboard-host: '0.0.0.0'
  containerEnv:
    - name: RAY_REDIS_ADDRESS
      value: "rediss://master.testray.ymfvee.euw1.cache.amazonaws.com:6379"
    - name: REDIS_PASSWORD
      value: 'edited'
[...]
  annotations:
    ray.io/ft-enabled: "true"

Ray versions tested: 2.7.0, 2.8.0 Redis engine: 7.0.7

Reproduction script

Testing connection from the ray pod:

root@raycluster-kuberay-head-6k54l:/home/ray/redis-stable# redis-cli -h master.testray.ymfvee.euw1.cache.amazonaws.com --tls -a 'edited' -p 6379
Warning: Using a password with '-a' or '-u' option on the command line interface may not be safe.
master.testray.ymfvee.euw1.cache.amazonaws.com:6379>
master.testray.ymfvee.euw1.cache.amazonaws.com:6379>
master.testray.ymfvee.euw1.cache.amazonaws.com:6379> hello
 1) "server"
 2) "redis"
 3) "version"
 4) "7.0.7"
 5) "proto"
 6) (integer) 2
 7) "id"
 8) (integer) 21947
 9) "mode"
10) "standalone"
11) "role"
12) "master"
13) "modules"
14) (empty array)
master.testray.ymfvee.euw1.cache.amazonaws.com:6379> ping
PONG
master.testray.ymfvee.euw1.cache.amazonaws.com:6379>

Anything else

  1. Ray starts correctly if I disable auth on Redis (Elasticache) (equivalent of no password)
  2. I can connect to the Redis instance from the ray pod using redis-cli
  3. DNS is resolving correctly

stacktrace info:

[2023-11-13 01:47:48,664 C 8 8] (ray_init) redis_context.cc:484:  Check failed: redisInitiateSSLWithContext(context_, ssl_context_) == REDIS_OK Failed to setup encrypted redis: SSL_connect failed: CERTIFICATE_VERIFY_FAILED
*** StackTrace Information ***
/home/ray/anaconda3/lib/python3.8/site-packages/ray/_raylet.so(+0xf341ba) [0x7f7fa7ff11ba] ray::operator<<()
/home/ray/anaconda3/lib/python3.8/site-packages/ray/_raylet.so(+0xf35ca2) [0x7f7fa7ff2ca2] ray::SpdLogMessage::Flush()
/home/ray/anaconda3/lib/python3.8/site-packages/ray/_raylet.so(_ZN3ray6RayLogD1Ev+0x37) [0x7f7fa7ff2fb7] ray::RayLog::~RayLog()
/home/ray/anaconda3/lib/python3.8/site-packages/ray/_raylet.so(_ZN3ray3gcs12RedisContext7ConnectERKSsibS3_b+0xc17) [0x7f7fa7983a77] ray::gcs::RedisContext::Connect()
/home/ray/anaconda3/lib/python3.8/site-packages/ray/_raylet.so(_ZN3ray3gcs11RedisClient7ConnectESt6vectorIP23instrumented_io_contextSaIS4_EE+0x20f) [0x7f7fa7979aef] ray::gcs::RedisClient::Connect()
/home/ray/anaconda3/lib/python3.8/site-packages/ray/_raylet.so(_ZN3ray3gcs11RedisClient7ConnectER23instrumented_io_context+0xbc) [0x7f7fa797ad3c] ray::gcs::RedisClient::Connect()
/home/ray/anaconda3/lib/python3.8/site-packages/ray/_raylet.so(_ZN3ray3gcs15RedisGetKeySyncERKSsiS2_bS2_S2_PSs+0x311) [0x7f7fa771fec1] ray::gcs::RedisGetKeySync()
/home/ray/anaconda3/lib/python3.8/site-packages/ray/_raylet.so(+0x6638de) [0x7f7fa77208de] __pyx_pw_3ray_7_raylet_29get_session_key_from_storage()
/home/ray/anaconda3/bin/python(PyCFunction_Call+0x52) [0x4dfd82] PyCFunction_Call
/home/ray/anaconda3/bin/python(_PyObject_MakeTpCall+0x3eb) [0x4d0c5b] _PyObject_MakeTpCall
/home/ray/anaconda3/bin/python(_PyEval_EvalFrameDefault+0x4f58) [0x4cbcf8] _PyEval_EvalFrameDefault
/home/ray/anaconda3/bin/python(_PyFunction_Vectorcall+0x106) [0x4d9d16] _PyFunction_Vectorcall
/home/ray/anaconda3/bin/python(_PyEval_EvalFrameDefault+0xade) [0x4c787e] _PyEval_EvalFrameDefault
/home/ray/anaconda3/bin/python(_PyEval_EvalCodeWithName+0x1f5) [0x4c5c45] _PyEval_EvalCodeWithName
/home/ray/anaconda3/bin/python(_PyFunction_Vectorcall+0x19c) [0x4d9dac] _PyFunction_Vectorcall
/home/ray/anaconda3/bin/python(_PyObject_FastCallDict+0x25f) [0x4d028f] _PyObject_FastCallDict
/home/ray/anaconda3/bin/python() [0x4e401a] slot_tp_init
/home/ray/anaconda3/bin/python(_PyObject_MakeTpCall+0x34b) [0x4d0bbb] _PyObject_MakeTpCall
/home/ray/anaconda3/bin/python(_PyEval_EvalFrameDefault+0x55fd) [0x4cc39d] _PyEval_EvalFrameDefault
/home/ray/anaconda3/bin/python(_PyEval_EvalCodeWithName+0x1f5) [0x4c5c45] _PyEval_EvalCodeWithName
/home/ray/anaconda3/bin/python(_PyFunction_Vectorcall+0x19c) [0x4d9dac] _PyFunction_Vectorcall
/home/ray/anaconda3/bin/python(PyObject_Call+0x5e) [0x4ec16e] PyObject_Call
/home/ray/anaconda3/bin/python(_PyEval_EvalFrameDefault+0x2051) [0x4c8df1] _PyEval_EvalFrameDefault
/home/ray/anaconda3/bin/python(_PyEval_EvalCodeWithName+0x1f5) [0x4c5c45] _PyEval_EvalCodeWithName
/home/ray/anaconda3/bin/python(_PyFunction_Vectorcall+0x19c) [0x4d9dac] _PyFunction_Vectorcall
/home/ray/anaconda3/bin/python(PyObject_Call+0x5e) [0x4ec16e] PyObject_Call
/home/ray/anaconda3/bin/python(_PyEval_EvalFrameDefault+0x2051) [0x4c8df1] _PyEval_EvalFrameDefault
/home/ray/anaconda3/bin/python(_PyEval_EvalCodeWithName+0x1f5) [0x4c5c45] _PyEval_EvalCodeWithName
/home/ray/anaconda3/bin/python(_PyFunction_Vectorcall+0x19c) [0x4d9dac] _PyFunction_Vectorcall
/home/ray/anaconda3/bin/python() [0x4e8197] method_vectorcall
/home/ray/anaconda3/bin/python(PyObject_Call+0x5e) [0x4ec16e] PyObject_Call
/home/ray/anaconda3/bin/python(_PyEval_EvalFrameDefault+0x2051) [0x4c8df1] _PyEval_EvalFrameDefault
/home/ray/anaconda3/bin/python(_PyFunction_Vectorcall+0x106) [0x4d9d16] _PyFunction_Vectorcall
/home/ray/anaconda3/bin/python(_PyEval_EvalFrameDefault+0xade) [0x4c787e] _PyEval_EvalFrameDefault
/home/ray/anaconda3/bin/python(_PyEval_EvalCodeWithName+0x1f5) [0x4c5c45] _PyEval_EvalCodeWithName
/home/ray/anaconda3/bin/python(_PyFunction_Vectorcall+0x19c) [0x4d9dac] _PyFunction_Vectorcall
[2023-11-13 01:47:48,664 C 8 8] (ray_init) redis_context.cc:484:  Check failed: redisInitiateSSLWithContext(context_, ssl_context_) == REDIS_OK Failed to setup encrypted redis: SSL_connect failed: CERTIFICATE_VERIFY_FAILED
*** StackTrace Information ***

Issue Severity

Medium: It is a significant difficulty but I can work around it.

jjyao commented 1 year ago

@marco-aws you need to set the the right certificate I think. Try setting RAY_REDIS_CA_CERT=/etc/ssl/certs/ca-certificates.crt and see if it works.

marco-aws commented 1 year ago

it works, thanks @jjyao. Should I raise a PR to add this to the documentation ?

jjyao commented 12 months ago

@marco-aws That will be great!

letaoj commented 1 month ago

@marco-aws Hi marco, I'm trying to do the same thing using Ray to connect to TLS-enabled Elasticache. I'm stuck with the cert generation part. Where did you generate the certs, or is something already available on an ec2 host?