rh-aiservices-bu / llm-on-openshift

Resources, demos, recipes,... to work with LLMs on OpenShift with OpenShift AI or Open Data Hub.
Apache License 2.0
89 stars 86 forks source link

Multi GPU setup for VLLM in Openshift does not work #64

Closed jayteaftw closed 2 months ago

jayteaftw commented 3 months ago

Hi, so tried using your deployment.yaml; however while the single GPU instance works, multi GPU stalls. Here is the log output

/opt/app-root/lib64/python3.11/site-packages/huggingface_hub/file_download.py:1132: FutureWarning: `resume_download` is deprecated and will be removed in version 1.0.0. Downloads always resume when possible. If you want to force a new download, use `force_download=True`.
  warnings.warn(
WARNING 06-10 20:49:01 config.py:1155] Casting torch.bfloat16 to torch.float16.
2024-06-10 20:49:04,319 INFO worker.py:1749 -- Started a local Ray instance.
INFO 06-10 20:49:04 llm_engine.py:161] Initializing an LLM engine (v0.4.3) with config: model='mistralai/Mistral-7B-Instruct-v0.2', speculative_config=None, tokenizer='mistralai/Mistral-7B-Instruct-v0.2', skip_tokenizer_init=False, tokenizer_mode=auto, revision=None, rope_scaling=None, tokenizer_revision=None, trust_remote_code=False, dtype=torch.float16, max_seq_len=6144, download_dir=None, load_format=LoadFormat.AUTO, tensor_parallel_size=2, disable_custom_all_reduce=False, quantization=None, enforce_eager=False, kv_cache_dtype=auto, quantization_param_path=None, device_config=cuda, decoding_config=DecodingConfig(guided_decoding_backend='outlines'), seed=0, served_model_name=mistralai/Mistral-7B-Instruct-v0.2)

And the deployment file modified

kind: Deployment
apiVersion: apps/v1
metadata:
  name: vllm
  labels:
    app: vllm
spec:
  replicas: 1
  selector:
    matchLabels:
      app: vllm
  template:
    metadata:
      creationTimestamp: null
      labels:
        app: vllm
    spec:
      imagePullSecrets:
      - name: regcred
      restartPolicy: Always
      schedulerName: default-scheduler
      affinity: {}
      terminationGracePeriodSeconds: 120
      securityContext: {}
      containers:
        - resources:
            limits:
              cpu: '8'
              memory: 24Gi
              nvidia.com/gpu: '2'
            requests:
              cpu: '6'
          readinessProbe:
            httpGet:
              path: /health
              port: http
              scheme: HTTP
            timeoutSeconds: 5
            periodSeconds: 30
            successThreshold: 1
            failureThreshold: 3
          terminationMessagePath: /dev/termination-log
          name: server
          livenessProbe:
            httpGet:
              path: /health
              port: http
              scheme: HTTP
            timeoutSeconds: 8
            periodSeconds: 100
            successThreshold: 1
            failureThreshold: 3
          env:
            - name: HUGGING_FACE_HUB_TOKEN
              value: xxxxxx
          args: [
            "--model",
            "mistralai/Mistral-7B-Instruct-v0.2",
            "--dtype", "float16",
            "--max-model-len", "6144",
            "--tensor-parallel-size", "2"]
          securityContext:
            capabilities:
              drop:
                - ALL
            runAsNonRoot: false
            allowPrivilegeEscalation: True
            seccompProfile:
              type: RuntimeDefault
          ports:
            - name: http
              containerPort: 8000
              protocol: TCP
          imagePullPolicy: IfNotPresent
          startupProbe:
            httpGet:
              path: /health
              port: http
              scheme: HTTP
            timeoutSeconds: 1
            periodSeconds: 30
            successThreshold: 1
            failureThreshold: 24
          volumeMounts:
            - mountPath: /opt/app-root/src/.cache/huggingface/hub
              name: model
            - name: shm
              mountPath: /dev/shm
          terminationMessagePolicy: File
          image: 'quay.io/rh-aiservices-bu/vllm-openai-ubi9:0.4.2'
      volumes:
        - name: model
          persistentVolumeClaim:
            claimName: hub-pv-filesystem
        - name: shm
          emptyDir:
            medium: Memory
            sizeLimit: 10Gi
      dnsPolicy: ClusterFirst
      tolerations:
        - key: nvidia.com/gpu
          operator: Exists
          effect: NoSchedule
  strategy:
    type: Recreate
  revisionHistoryLimit: 10
  progressDeadlineSeconds: 600
guimou commented 3 months ago

Hum, that's odd... And no more logs, it stays like that? How much time did you wait? (to see if more errors were given after eventual timeout)

bbrowning commented 3 months ago

My team is successfully running the same model (Mistral-7B-Instruct-v0.2) with the same 0.4.2 version of this image on multi-gpu setups without issue. I'm not sure why it appears stalled in this case, but it does work generally.

guimou commented 2 months ago

No more comments, closing at least for now.