ray-project / ray

Ray is an AI compute engine. Ray consists of a core distributed runtime and a set of AI Libraries for accelerating ML workloads.
https://ray.io
Apache License 2.0
33.97k stars 5.77k forks source link

[Cluster,Serve] Cluster fault tolerance breaks down when configuring silent head node updates. Applications are not recoverable. #42390

Open psydok opened 10 months ago

psydok commented 10 months ago

What happened + What you expected to happen

  1. I raised head node in cube (without using kuberay).
  2. Enabled gsc fault tolerance (connected redis).
  3. Then I ran the command serve start --http-host 0.0.0.0.0 --http-port=8000 --grpc-port=9000 .... on the head node. And started my application via command ray job submit command. Everything works.
  4. Then I try to update the head node (e.g. update a variable, simulate a problem with the head node).
  5. The head node start on a different hardware. And I get the following error on Overview page under "Resource Status". In page Actors, I also see SERVE_CONTROLLER_ACTOR in status RESTARTING (inside the page SERVE_CONTROLLER_ACTOR -> PENDING_NODE_ASSIGNMENT. Although the remoted node has not changed, only the head has changed):
    # http://k8s_ip:8265/#/overview
    Demands:
    {'node:__internal_head__': 0.001}: 2+ pending tasks/actors
    The autoscaler failed with the following error:
    Terminated with signal 15
    File "/usr/local/lib/python3.11/site-packages/ray/autoscaler/_private/monitor.py", line 711, in <module>
    monitor.run()
    File "/usr/local/lib/python3.11/site-packages/ray/autoscaler/_private/monitor.py", line 586, in run
    self._run()
    File "/usr/local/lib/python3.11/site-packages/ray/autoscaler/_private/monitor.py", line 368, in _run
    self.update_load_metrics()
    File "/usr/local/lib/python3.11/site-packages/ray/autoscaler/_private/monitor.py", line 244, in update_load_metrics
    response = self.gcs_client.get_all_resource_usage(timeout=60)
    File "/usr/local/lib/python3.11/site-packages/google/protobuf/internal/python_message.py", line 495, in init
    def init(self, **kwargs):
  6. The page at "http://k8s_ip:8265/#/serve" stops loading. Cluster fault tolerance breaks down when configuring silent head node updates. Applications are not recoverable. I've opened all the log files, but can't find any clue how to solve it.

Found this error. But I have no idea how to set soft=True to serve start ....:

ActorUnschedulableError: The actor is not schedulable: The node specified via NodeAffinitySchedulingStrategy doesn't exist any more or is infeasible, and soft=False was specified

Versions / Dependencies

python==3.11.5 ray==2.8.1 kubectl==v0.23.1 argocd==v2.4.12+41f54aa

Reproduction script

# Dockerfile for nodes
FROM python:3.11.5-slim
RUN apt-get update && apt-get install -y g++ gcc libsndfile1 git ffmpeg podman runc curl \
    && rm -rf /var/lib/apt/lists/* \
    && apt-get clean
RUN curl -fsSL https://nvidia.github.io/libnvidia-container/gpgkey | gpg --dearmor -o /usr/share/keyrings/nvidia-container-toolkit-keyring.gpg \
  && curl -s -L https://nvidia.github.io/libnvidia-container/stable/deb/nvidia-container-toolkit.list | \
  sed 's#deb https://#deb [signed-by=/usr/share/keyrings/nvidia-container-toolkit-keyring.gpg] https://#g' | \
  tee /etc/apt/sources.list.d/nvidia-container-toolkit.list \
  && apt-get update
RUN apt-get install -y nvidia-container-toolkit
ENTRYPOINT ["/root/ray/entrypoint.sh"]
# entrypoint.sh for head node
# if [ -n "$CURRENT_NODE_IP" ]; then
#    CURRENT_NODE_IP="--node-ip-address=${CURRENT_NODE_IP}"
# fi
RAY_REDIS_ADDRESS="external_ip:6370"
RAY_CONFIG_CREATING_NODE="--head --dashboard-host=0.0.0.0 --num-cpus=0 --num-gpus=0"
ulimit -n 65536; ray start \
    --min-worker-port=$MIN_WORKER_PORT \
    --max-worker-port=$MAX_WORKER_PORT \
    --node-manager-port=$NODE_MANAGER_PORT \
    --object-manager-port=$OBJECT_MANAGER_PORT \
    --ray-client-server-port=$RAY_CLIENT_PORT \
    --dashboard-grpc-port=$RAY_DASHBOARD_GRPC_PORT \
    --dashboard-agent-listen-port=$DASHBOARD_AGENT_LISTEN_PORT \
    --dashboard-agent-grpc-port=$DASHBOARD_AGENT_GRPC_PORT \
    --runtime-env-agent-port=$RUNTIME_ENV_AGENT_PORT \
    --metrics-export-port=$METRICS_EXPORT_PORT \
    $CURRENT_NODE_IP $RAY_CONFIG_CREATING_NODE \
    --block
# deployment.yaml
apiVersion: apps/v1
kind: Deployment
metadata:
  labels:
    app: {{ .Release.Name }}
    chart: "{{ .Release.Name }}-{{ .Chart.Version }}"
    release: stable
    heritage: {{ .Release.Service }}
    component: {{ .Release.Name }}
  name: {{ .Release.Name }}
spec:
  replicas: 1
  selector:
    matchLabels:
      app: {{ .Release.Name }}
      release: stable
      component: {{ .Release.Name }}
  strategy:
    rollingUpdate:
      maxUnavailable: 0
      maxSurge: 1
    type: RollingUpdate
  revisionHistoryLimit: 0

  template:
    metadata:
      labels:
        app: {{ .Release.Name }}
        release: stable
        component: {{ .Release.Name }}

    spec:
      imagePullSecrets:
        - name: registry-secret-ml
      containers:
        - name: {{ .Release.Name }}
          image: {{ .Values.deployment.image }}
          imagePullPolicy: IfNotPresent
          volumeMounts:
            - mountPath: /tmp/ray
              name: log-volume
            # - mountPath: /dev/shm
            #   name: dshm
          env:
          {{- range $key, $val := .Values.deployment.env }}
          - name: {{ $key | quote }}
            value: {{ $val | quote }}
          {{- end }}
          - name: MY_CPU_REQUEST
            valueFrom:
              resourceFieldRef:
                resource: requests.cpu
          lifecycle:
            postStart:
              exec:
                command: ["/bin/bash", "-c", "for i in {1..5}; do serve start --proxy-location EveryNode \
                --http-host 0.0.0.0 --http-port 8000 --grpc-port 9000 \
                --grpc-servicer-functions test_pb2_grpc.add_TestServicer_to_server && break || sleep 15; done"]
            preStop:
              exec:
                command: ["/bin/bash", "-c", "ray stop"]
          {{- if .Values.deployment.useProbes }}
          livenessProbe:
            initialDelaySeconds: 60
            failureThreshold: 3
            periodSeconds: 5
            successThreshold: 1
            timeoutSeconds: 30
            httpGet:
              path: /api/gcs_healthz
              port: 8265
              scheme: HTTP
          readinessProbe: 
            initialDelaySeconds: 60
            failureThreshold: 3
            periodSeconds: 5
            successThreshold: 1
            timeoutSeconds: 15
            httpGet:
              path: /api/gcs_healthz
              port: 8265
              scheme: HTTP
          {{- end }}
          ports:
          - containerPort: 6379
          - containerPort: 8265
          - containerPort: 10001
          - containerPort: 9099
          - containerPort: 8099
          - containerPort: 52365
          - containerPort: 45001
          - containerPort: 35369
          - containerPort: 19124
          - containerPort: 59671
          - containerPort: 54223
          - containerPort: 9265
          resources:
{{ toYaml .Values.deployment.resources | trim | indent 12 }}
      dnsPolicy: Default
      volumes:
        - name: log-volume
          emptyDir: {}
          # persistentVolumeClaim:
          #   claimName: ray-head-pvc
        # - name: dshm
        #   emptyDir:
        #     medium: Memory
      affinity:
        podAntiAffinity:
          requiredDuringSchedulingIgnoredDuringExecution:
            - labelSelector:
                matchExpressions:
                  - key: release
                    operator: In
                    values:
                      - "{{ .Release.Name }}"
              topologyKey: kubernetes.io/hostname
      restartPolicy: {{ .Values.deployment.restartPolicy }}
      priorityClassName: {{ .Values.deployment.priority }}
      terminationGracePeriodSeconds: 20
# service.yaml
apiVersion: v1
kind: Service
metadata:
  labels:
    app: {{ .Release.Name }}
    chart: "{{ .Release.Name }}-{{ .Chart.Version }}"
    release: stable
    heritage: {{ .Release.Service }}
    component: {{ .Release.Name }}
  name: {{ .Release.Name }}
spec:
  externalTrafficPolicy: Cluster
  type: LoadBalancer
  ports:
    - name: rayheadport
      protocol: TCP
      port: 6379
      targetPort: 6379
    - name: rayclientport
      protocol: TCP
      port: 10001
      targetPort: 10001
    - name: raydashboardport
      protocol: TCP
      port: 8265
      targetPort: 8265
    - name: httpport
      protocol: TCP
      port: 8000
      targetPort: 8000
    - name: grpcport
      protocol: TCP
      port: 9000
      targetPort: 9000
    - name: objectmanagerport
      protocol: TCP
      port: 45001
      targetPort: 45001
    - name: nodemanagerport
      protocol: TCP
      port: 35369
      targetPort: 35369
    - name: dashboardagentlistenport
      protocol: TCP
      port: 52365
      targetPort: 52365
    - name: runtimeenvagentport
      protocol: TCP
      port: 19124
      targetPort: 19124
    - name: metricsexportport
      protocol: TCP
      port: 59671
      targetPort: 59671
    - name: dashboardagentgrpcport
      protocol: TCP
      port: 54223
      targetPort: 54223
    - name: raydashboardgrpcport
      protocol: TCP
      port: 9265
      targetPort: 9265
  selector:
    app: {{ .Release.Name }}
    release: stable
    component: {{ .Release.Name }}
  sessionAffinity: None

Issue Severity

High: It blocks me from completing my task.

psydok commented 10 months ago

I also get this error when I try to update the app (the cluster shows that the application itself is running, but the controller and proxy on the head node are not running):

WARNING worker.py:2052 -- The autoscaler failed with the following error:
5Terminated with signal 15
6  File "/usr/local/lib/python3.11/site-packages/ray/autoscaler/_private/monitor.py", line 711, in <module>
7    monitor.run()
8  File "/usr/local/lib/python3.11/site-packages/ray/autoscaler/_private/monitor.py", line 586, in run
9    self._run()
10  File "/usr/local/lib/python3.11/site-packages/ray/autoscaler/_private/monitor.py", line 440, in _run
11    time.sleep(AUTOSCALER_UPDATE_INTERVAL_S)
anyscalesam commented 9 months ago

@sihanwang41 please triage