[RayService] [GCS FT] Worker nodes don't serve traffic while head node is down

shrekris-anyscale commented 2 years ago

Search before asking

[X] I searched the issues and found no similar issues.

KubeRay Component

Others

What happened + What you expected to happen

I had a Kubernetes cluster on GKE with 2 nodes that was running a RayService. It had 2 worker pods and 1 head pod. It also had a 1-node Redis cluster configured to support GCS Fault Tolerance:

$ kubectl get pods -o wide

NAME                                                      READY   STATUS    RESTARTS      AGE     IP           NODE                                        NOMINATED NODE   READINESS GATES
ervice-sample-raycluster-thwmr-worker-small-group-6f2pk   1/1     Running   0             6m59s   10.68.2.64   gke-serve-demo-default-pool-ed597cce-nvm2   <none>           <none>
ervice-sample-raycluster-thwmr-worker-small-group-bdv6q   1/1     Running   0             79m     10.68.2.62   gke-serve-demo-default-pool-ed597cce-nvm2   <none>           <none>
rayservice-sample-raycluster-thwmr-head-28mdh             1/1     Running   1 (79m ago)   79m     10.68.0.45   gke-serve-demo-default-pool-ed597cce-pu2q   <none>           <none>
redis-75c8b8b65d-4qgfz                                    1/1     Running   0             79m     10.68.2.60   gke-serve-demo-default-pool-ed597cce-nvm2   <none>           <none>

I started a port-forward to a worker pod and successfully got responses from my deployments:

$ port-forward ervice-sample-raycluster-thwmr-worker-small-group-bdv6q
Forwarding from 127.0.0.1:8000 -> 8000
Forwarding from [::1]:8000 -> 8000

$ curl localhost:8000
418

I then killed the head pod:

$ kubectl delete pod rayservice-sample-raycluster-thwmr-head-28mdh
pod "rayservice-sample-raycluster-thwmr-head-28mdh" deleted

Once the head pod was deleted, it started recovering:

$ kubectl get pods

NAME                                                      READY   STATUS              RESTARTS   AGE
ervice-sample-raycluster-thwmr-worker-small-group-6f2pk   1/1     Running             0          24m
ervice-sample-raycluster-thwmr-worker-small-group-bdv6q   1/1     Running             0          96m
rayservice-sample-raycluster-thwmr-head-8xjpx             0/1     ContainerCreating   0          5s
redis-75c8b8b65d-4qgfz                                    1/1     Running             0          96m

My port-forward did not immediately die, and the worker pod was not immediately restarted, which makes me think that GCS fault tolerance was configured correctly. However, while the head pod was recovering, all my curl requests hung. Note: my port-forward was eventually terminated and the worker pods were restarted after the head pod came back up.

$ curl localhost:8000

Eventually, the head pod came back up, and the worker pods were restarted. After that, I could reconnect to the cluster and get successful responses from my deployments.

I can't tell if I simply misconfigured GCS fault tolerance, or if this is how GCS fault tolerance is meant to behave.

Reproduction script

Serve application: https://github.com/ray-project/serve_config_examples/blob/42d10bab77741b40d11304ad66d39a4ec2345247/sleepy_pid.py

Kubernetes config file:

kind: ConfigMap
apiVersion: v1
metadata:
  name: redis-config
  labels:
    app: redis
data:
  redis.conf: |-
    port 6379
    bind 0.0.0.0
    protected-mode no
    requirepass 5241590000000000
---
apiVersion: v1
kind: Service
metadata:
  name: redis
  labels:
    app: redis
spec:
  type: ClusterIP
  ports:
    - name: redis
      port: 6379
  selector:
    app: redis
---
apiVersion: apps/v1
kind: Deployment
metadata:
  name: redis
  labels:
    app: redis
spec:
  replicas: 1
  selector:
    matchLabels:
      app: redis
  template:
    metadata:
      labels:
        app: redis
    spec:
      containers:
        - name: redis
          image: redis:5.0.8
          command:
            - "sh"
            - "-c"
            - "redis-server /usr/local/etc/redis/redis.conf"
          ports:
            - containerPort: 6379
          volumeMounts:
            - name: config
              mountPath: /usr/local/etc/redis/redis.conf
              subPath: redis.conf
      volumes:
        - name: config
          configMap:
            name: redis-config
---
apiVersion: ray.io/v1alpha1
kind: RayService
metadata:
  name: rayservice-sample
  annotations:
    ray.io/ft-enabled: "true"
spec:
  serveConfig:
    importPath: "sleepy_pid:app"
    runtimeEnv: |
      working_dir: "https://github.com/ray-project/serve_config_examples/archive/42d10bab77741b40d11304ad66d39a4ec2345247.zip"
    deployments:
      - name: SleepyPid
        numReplicas: 6
        rayActorOptions:
          numCpus: 0
  rayClusterConfig:
    rayVersion: '2.0.0'
    headGroupSpec:
      serviceType: ClusterIP
      replicas: 1
      rayStartParams:
        block: 'true'
        num-cpus: '2'
        object-store-memory: '100000000'
        dashboard-host: '0.0.0.0'
        node-ip-address: $MY_POD_IP # Auto-completed as the head pod IP
      template:
        spec:
          containers:
            - name: ray-head
              image: rayproject/ray:2.0.0
              imagePullPolicy: Always
              env:
                - name: MY_POD_IP
                  valueFrom:
                    fieldRef:
                      fieldPath: status.podIP
                - name: RAY_REDIS_ADDRESS
                  value: redis:6379
              resources:
                limits:
                  cpu: 2
                  memory: 2Gi
                requests:
                  cpu: 2
                  memory: 2Gi
              ports:
                - containerPort: 6379
                  name: redis
                - containerPort: 8265
                  name: dashboard
                - containerPort: 10001
                  name: client
                - containerPort: 8000
                  name: serve
    workerGroupSpecs:
      - replicas: 2
        minReplicas: 2
        maxReplicas: 2
        groupName: small-group
        rayStartParams:
          block: 'true'
          node-ip-address: $MY_POD_IP
        template:
          spec:
            initContainers:
              - name: init-myservice
                image: busybox:1.28
                command: ['sh', '-c', "until nslookup $RAY_IP.$(cat /var/run/secrets/kubernetes.io/serviceaccount/namespace).svc.cluster.local; do echo waiting for myservice; sleep 2; done"]
            containers:
              - name: machine-learning
                image: rayproject/ray:2.0.0
                imagePullPolicy: Always
                env:
                  - name:  RAY_DISABLE_DOCKER_CPU_WARNING
                    value: "1"
                  - name: TYPE
                    value: "worker"
                  - name: CPU_REQUEST
                    valueFrom:
                      resourceFieldRef:
                        containerName: machine-learning
                        resource: requests.cpu
                  - name: CPU_LIMITS
                    valueFrom:
                      resourceFieldRef:
                        containerName: machine-learning
                        resource: limits.cpu
                  - name: MEMORY_LIMITS
                    valueFrom:
                      resourceFieldRef:
                        containerName: machine-learning
                        resource: limits.memory
                  - name: MEMORY_REQUESTS
                    valueFrom:
                      resourceFieldRef:
                        containerName: machine-learning
                        resource: requests.memory
                  - name: MY_POD_NAME
                    valueFrom:
                      fieldRef:
                        fieldPath: metadata.name
                  - name: MY_POD_IP
                    valueFrom:
                      fieldRef:
                        fieldPath: status.podIP
                ports:
                  - containerPort: 80
                    name: client
                lifecycle:
                  preStop:
                    exec:
                      command: ["/bin/sh","-c","ray stop"]
                resources:
                  limits:
                    cpu: "1"
                    memory: "2Gi"
                  requests:
                    cpu: "500m"
                    memory: "2Gi"

Anything else

No response

Are you willing to submit a PR?

[ ] Yes I am willing to submit a PR!

DmitriGekhtman commented 2 years ago

cc @brucez-anyscale @wilsonwang371 @iycheng

wilsonwang371 commented 2 years ago

@brucez-anyscale Bruce, i remember we have seen something similar to this before while using port forwarding, right?

brucez-anyscale commented 2 years ago

I think @simon-mo and @iycheng have fixed this.

kevin85421 commented 1 year ago

Can we close this issue? Thanks! @shrekris-anyscale @brucez-anyscale

shrekris-anyscale commented 1 year ago

Hi @kevin85421, this is still an issue, but I'm not sure if it's caused by Ray Serve itself or by KubeRay. It's somewhat mitigated by this Ray change, but I think we should leave this issue open for tracking. I've classified it as a P2.

akshay-anyscale commented 1 year ago

@shrekris-anyscale what's the priority and impact of this issue now?

shrekris-anyscale commented 1 year ago

We've made more progress on this issue. #33384 will further reduce any downtime while the worker nodes are down. That change should ensure minimal downtime when this issue happens.

After merging that change, I'd be comfortable marking this issue as a P3, or closing it altogether.

ray-project / kuberay