ray-project / kuberay

A toolkit to run Ray applications on Kubernetes
Apache License 2.0
1.19k stars 386 forks source link

[RayService] HTTPProxies don't update with new replicas under GCS FT #627

Closed shrekris-anyscale closed 2 years ago

shrekris-anyscale commented 2 years ago

Search before asking

KubeRay Component

Others

What happened + What you expected to happen

Note: It turns out that this is not an issue. The bug in #616 was likely due to a misconfigured config file. The HTTPProxies do update with new replicas. I'm filing (and subsequently closing) this issue for tracking purposes since #616 didn't have a clean repro.

When KubeRay is deployed with GCS Fault Tolerance (FT), Serve's HTTPProxies don't update when a replica crashes and recovers. Instead, they seem to route requests only to the replicas that never crashed.

Reproduction script

Python Code:

# File name: sleepy_pid.py

from ray import serve

@serve.deployment
class SleepyPid:

    def __init__(self):
        import time
        time.sleep(10)

    def __call__(self) -> int:
        import os
        return os.getpid()

app = SleepyPid.bind()
Vanilla Kubernetes Config
# Make sure to increase resource requests and limits before using this example in production.
# For examples with more realistic resource configuration, see
# ray-cluster.complete.large.yaml and
# ray-cluster.autoscaler.large.yaml.
apiVersion: ray.io/v1alpha1
kind: RayService
metadata:
  name: rayservice-sample
spec:
  serviceUnhealthySecondThreshold: 300 # Config for the health check threshold for service. Default value is 60.
  deploymentUnhealthySecondThreshold: 300 # Config for the health check threshold for deployments. Default value is 60.
  serveConfig:
    importPath: "sleepy_pid:app"
    runtimeEnv: |
      working_dir: "https://github.com/ray-project/serve_config_examples/archive/42d10bab77741b40d11304ad66d39a4ec2345247.zip"
    deployments:
      - name: SleepyPid
        numReplicas: 6
        rayActorOptions:
          numCpus: 0
  rayClusterConfig:
    rayVersion: '2.0.0' # should match the Ray version in the image of the containers
    ######################headGroupSpecs#################################
    # head group template and specs, (perhaps 'group' is not needed in the name)
    headGroupSpec:
      # Kubernetes Service Type, valid values are 'ClusterIP', 'NodePort' and 'LoadBalancer'
      serviceType: ClusterIP
      # the pod replicas in this group typed head (assuming there could be more than 1 in the future)
      replicas: 1
      # logical group name, for this called head-group, also can be functional
      # pod type head or worker
      # rayNodeType: head # Not needed since it is under the headgroup
      # the following params are used to complete the ray start: ray start --head --block --redis-port=6379 ...
      rayStartParams:
        port: '6379' # should match container port named gcs-server
        #include_webui: 'true'
        object-store-memory: '100000000'
        # webui_host: "10.1.2.60"
        dashboard-host: '0.0.0.0'
        num-cpus: '2' # can be auto-completed from the limits
        node-ip-address: $MY_POD_IP # auto-completed as the head pod IP
        block: 'true'
      #pod template
      template:
        metadata:
          labels:
            # custom labels. NOTE: do not define custom labels start with `raycluster.`, they may be used in controller.
            # Refer to https://kubernetes.io/docs/concepts/overview/working-with-objects/labels/
            rayCluster: raycluster-sample # will be injected if missing
            rayNodeType: head # will be injected if missing, must be head or wroker
            groupName: headgroup # will be injected if missing
          # annotations for pod
          annotations:
            key: value
        spec:
          containers:
            - name: ray-head
              image: rayproject/ray:2.0.0
              imagePullPolicy: Always
              #image: bonsaidev.azurecr.io/bonsai/lazer-0-9-0-cpu:dev
              env:
                - name: MY_POD_IP
                  valueFrom:
                    fieldRef:
                      fieldPath: status.podIP
              resources:
                limits:
                  cpu: 2
                  memory: 2Gi
                requests:
                  cpu: 2
                  memory: 2Gi
              ports:
                - containerPort: 6379
                  name: gcs-server
                - containerPort: 8265 # Ray dashboard
                  name: dashboard
                - containerPort: 10001
                  name: client
                - containerPort: 8000
                  name: serve
    workerGroupSpecs:
      # the pod replicas in this group typed worker
      - replicas: 1
        minReplicas: 1
        maxReplicas: 5
        # logical group name, for this called small-group, also can be functional
        groupName: small-group
        # if worker pods need to be added, we can simply increment the replicas
        # if worker pods need to be removed, we decrement the replicas, and populate the podsToDelete list
        # the operator will remove pods from the list until the number of replicas is satisfied
        # when a pod is confirmed to be deleted, its name will be removed from the list below
        #scaleStrategy:
        #  workersToDelete:
        #  - raycluster-complete-worker-small-group-bdtwh
        #  - raycluster-complete-worker-small-group-hv457
        #  - raycluster-complete-worker-small-group-k8tj7
        # the following params are used to complete the ray start: ray start --block --node-ip-address= ...
        rayStartParams:
          node-ip-address: $MY_POD_IP
          block: 'true'
        #pod template
        template:
          metadata:
            labels:
              key: value
            # annotations for pod
            annotations:
              key: value
          spec:
            initContainers:
              # the env var $RAY_IP is set by the operator if missing, with the value of the head service name
              - name: init-myservice
                image: busybox:1.28
                command: ['sh', '-c', "until nslookup $RAY_IP.$(cat /var/run/secrets/kubernetes.io/serviceaccount/namespace).svc.cluster.local; do echo waiting for myservice; sleep 2; done"]
            containers:
              - name: machine-learning # must consist of lower case alphanumeric characters or '-', and must start and end with an alphanumeric character (e.g. 'my-name',  or '123-abc'
                image: rayproject/ray:2.0.0
                imagePullPolicy: Always
                # environment variables to set in the container.Optional.
                # Refer to https://kubernetes.io/docs/tasks/inject-data-application/define-environment-variable-container/
                env:
                  - name:  RAY_DISABLE_DOCKER_CPU_WARNING
                    value: "1"
                  - name: TYPE
                    value: "worker"
                  - name: CPU_REQUEST
                    valueFrom:
                      resourceFieldRef:
                        containerName: machine-learning
                        resource: requests.cpu
                  - name: CPU_LIMITS
                    valueFrom:
                      resourceFieldRef:
                        containerName: machine-learning
                        resource: limits.cpu
                  - name: MEMORY_LIMITS
                    valueFrom:
                      resourceFieldRef:
                        containerName: machine-learning
                        resource: limits.memory
                  - name: MEMORY_REQUESTS
                    valueFrom:
                      resourceFieldRef:
                        containerName: machine-learning
                        resource: requests.memory
                  - name: MY_POD_NAME
                    valueFrom:
                      fieldRef:
                        fieldPath: metadata.name
                  - name: MY_POD_IP
                    valueFrom:
                      fieldRef:
                        fieldPath: status.podIP
                ports:
                  - containerPort: 80
                    name: client
                lifecycle:
                  preStop:
                    exec:
                      command: ["/bin/sh","-c","ray stop"]
                resources:
                  limits:
                    cpu: "1"
                    memory: "2Gi"
                  requests:
                    cpu: "500m"
Fault Tolerant (FT) Kubernetes Config
# File name: ft_config.yaml

kind: ConfigMap
apiVersion: v1
metadata:
  name: redis-config
  labels:
    app: redis
data:
  redis.conf: |-
    port 6379
    bind 0.0.0.0
    protected-mode no
    requirepass 5241590000000000
---
apiVersion: v1
kind: Service
metadata:
  name: redis
  labels:
    app: redis
spec:
  type: ClusterIP
  ports:
    - name: redis
      port: 6379
  selector:
    app: redis
---
apiVersion: apps/v1
kind: Deployment
metadata:
  name: redis
  labels:
    app: redis
spec:
  replicas: 1
  selector:
    matchLabels:
      app: redis
  template:
    metadata:
      labels:
        app: redis
    spec:
      containers:
        - name: redis
          image: redis:5.0.8
          command:
            - "sh"
            - "-c"
            - "redis-server /usr/local/etc/redis/redis.conf"
          ports:
            - containerPort: 6379
          volumeMounts:
            - name: config
              mountPath: /usr/local/etc/redis/redis.conf
              subPath: redis.conf
      volumes:
        - name: config
          configMap:
            name: redis-config
---
apiVersion: ray.io/v1alpha1
kind: RayService
metadata:
  name: rayservice-sample
  annotations:
    ray.io/ft-enabled: "true"
spec:
  serviceUnhealthySecondThreshold: 300
  deploymentUnhealthySecondThreshold: 300
  serveConfig:
    importPath: "sleepy_pid:app"
    runtimeEnv: |
      working_dir: "https://github.com/ray-project/serve_config_examples/archive/42d10bab77741b40d11304ad66d39a4ec2345247.zip"
    deployments:
      - name: SleepyPid
        numReplicas: 6
        rayActorOptions:
          numCpus: 0
  rayClusterConfig:
    rayVersion: '2.0.0'
    headGroupSpec:
      serviceType: ClusterIP
      replicas: 1
      rayStartParams:
        block: 'true'
        num-cpus: '2'
        object-store-memory: '100000000'
        dashboard-host: '0.0.0.0'
        node-ip-address: $MY_POD_IP # Auto-completed as the head pod IP
      template:
        spec:
          containers:
            - name: ray-head
              image: rayproject/ray:2.0.0
              imagePullPolicy: Always
              env:
                - name: MY_POD_IP
                  valueFrom:
                    fieldRef:
                      fieldPath: status.podIP
                - name: RAY_REDIS_ADDRESS
                  value: redis:6379
              resources:
                limits:
                  cpu: 2
                  memory: 2Gi
                requests:
                  cpu: 2
                  memory: 2Gi
              ports:
                - containerPort: 6379
                  name: redis
                - containerPort: 8265
                  name: dashboard
                - containerPort: 10001
                  name: client
                - containerPort: 8000
                  name: serve
    workerGroupSpecs:
      - replicas: 1
        minReplicas: 1
        maxReplicas: 1
        groupName: small-group
        rayStartParams:
          block: 'true'
          node-ip-address: $MY_POD_IP
        template:
          spec:
            initContainers:
              - name: init-myservice
                image: busybox:1.28
                command: ['sh', '-c', "until nslookup $RAY_IP.$(cat /var/run/secrets/kubernetes.io/serviceaccount/namespace).svc.cluster.local; do echo waiting for myservice; sleep 2; done"]
            containers:
              - name: machine-learning
                image: rayproject/ray:2.0.0
                imagePullPolicy: Always
                env:
                  - name:  RAY_DISABLE_DOCKER_CPU_WARNING
                    value: "1"
                  - name: TYPE
                    value: "worker"
                  - name: CPU_REQUEST
                    valueFrom:
                      resourceFieldRef:
                        containerName: machine-learning
                        resource: requests.cpu
                  - name: CPU_LIMITS
                    valueFrom:
                      resourceFieldRef:
                        containerName: machine-learning
                        resource: limits.cpu
                  - name: MEMORY_LIMITS
                    valueFrom:
                      resourceFieldRef:
                        containerName: machine-learning
                        resource: limits.memory
                  - name: MEMORY_REQUESTS
                    valueFrom:
                      resourceFieldRef:
                        containerName: machine-learning
                        resource: requests.memory
                  - name: MY_POD_NAME
                    valueFrom:
                      fieldRef:
                        fieldPath: metadata.name
                  - name: MY_POD_IP
                    valueFrom:
                      fieldRef:
                        fieldPath: status.podIP
                ports:
                  - containerPort: 80
                    name: client
                lifecycle:
                  preStop:
                    exec:
                      command: ["/bin/sh","-c","ray stop"]
                resources:
                  limits:
                    cpu: "1"
                    memory: "2Gi"
                  requests:
                    cpu: "500m"
                    memory: "2Gi"

Reproduction Steps

  1. Install the Kubernetes Operator from master (to include #540, which fixes #539):
    kubectl create -k "github.com/ray-project/kuberay/ray-operator/config/default?ref=master&timeout=90s"
    $ kubectl get deployments -n ray-system
    NAME                READY   UP-TO-DATE   AVAILABLE   AGE
    kuberay-operator    1/1     1            1           13s
  2. Apply the vanilla Kubernetes config:
    kubectl apply vanilla_config.yaml
    $ kubectl get pods
    NAME                                                      READY   STATUS    RESTARTS        AGE
    ervice-sample-raycluster-r6fhh-worker-small-group-88t5h   1/1     Running   0               7m7s
    rayservice-sample-raycluster-r6fhh-head-k4wl5             1/1     Running   1 (6m59s ago)   7m7s
  3. Open a port-forward to the RayService
    kubectl port-forward service/rayservice-sample-serve-svc 8000
  4. Send many requests to the service. You should notice 6 different PIDs:
    shrekris@Shreyas-MacBook-Pro http_proxy_error % curl localhost:8000
    148%                                                                            
    shrekris@Shreyas-MacBook-Pro http_proxy_error % curl localhost:8000
    379%                                                                            
    shrekris@Shreyas-MacBook-Pro http_proxy_error % curl localhost:8000
    416%                                                                            
    shrekris@Shreyas-MacBook-Pro http_proxy_error % curl localhost:8000
    121%                                                                             
    shrekris@Shreyas-MacBook-Pro http_proxy_error % curl localhost:8000
    352%                                                                            
    shrekris@Shreyas-MacBook-Pro http_proxy_error % curl localhost:8000
    184%  
  5. Kill one of the replicas by execing into one of the pods, and using the Python interpreter. See this guide for an explanation.
  6. Send curl requests again. If you deployed the vanilla config, you should still see 6 different PIDs, but one of them will be new. This is the replica that died and recovered. This is the expected behavior.
  7. Repeat this process with the fault tolerant config. This time, you'll see only 5 different PIDs in step (6). This is the bug The replica that recovered will not be accessible through HTTP. However, you can access the new replica using a Serve handle (which you can check by execing into a worker pod, using the Python interpreter to do serve.get_deployment("SleepyPid").get_handle(), and then making requests through the handle). You can also see that the new replica is alive by running ray list actors --filter "class_name=ServeReplica:SleepyPid".

Anything else

This issue is not reproducible, so it's unlikely to be an actual bug. However, #616 raised the same issue but it didn't have as clean a repro. I'm filing this issue (and subsequently closing this) to track a clean repro in case it's useful in the future.

Are you willing to submit a PR?

shrekris-anyscale commented 2 years ago

This issue is not reproducible, so it's unlikely to be an actual bug. However, https://github.com/ray-project/kuberay/issues/616 raised the same issue but it didn't have as clean a repro. I'm filing this issue (and subsequently closing this) to track a clean repro in case it's useful in the future.