Documentation about GPU memory

WaterKnight1998 commented 1 year ago

Thank you very much for the incredible project!

First of all, it would be very helpfull that you add a documentation on how to manage GPU memory while using Triton.

I was doing several test but I couldn't understand how the following env parameters works: CONTAINER_MEM_REQ_BYTES and MODELSIZE_MULTIPLIER. I read the following explanation: https://github.com/kserve/modelmesh/issues/82#issuecomment-1582028690

I applied the following configuration for T4:

apiVersion: serving.kserve.io/v1alpha1
kind: ServingRuntime
metadata:
  annotations:
    maxLoadingConcurrency: "2"
  labels:
    app.kubernetes.io/instance: modelmesh-controller
    app.kubernetes.io/managed-by: modelmesh-controller
    app.kubernetes.io/name: modelmesh-controller
    name: modelmesh-serving-triton-2.x-SR
  name: triton-2.x
  # namespace: inference-server
spec:
  builtInAdapter:
    memBufferBytes: 134217728
    modelLoadingTimeoutMillis: 90000
    runtimeManagementPort: 8001
    serverType: triton
  containers:
  - args:
    - -c
    - 'mkdir -p /models/_triton_models; chmod 777 /models/_triton_models; exec tritonserver
      "--model-repository=/models/_triton_models" "--model-control-mode=explicit"
      "--strict-model-config=false" "--strict-readiness=false" "--allow-http=true"
      "--allow-sagemaker=false" '
    command:
    - /bin/sh
    image: nvcr.io/nvidia/tritonserver:21.06.1-py3
    livenessProbe:
      exec:
        command:
        - curl
        - --fail
        - --silent
        - --show-error
        - --max-time
        - "9"
        - http://localhost:8000/v2/health/live
      initialDelaySeconds: 5
      periodSeconds: 30
      timeoutSeconds: 10
    name: triton
    env:
    - name: CONTAINER_MEM_REQ_BYTES
      value: "12884901888" 
    - name: MODELSIZE_MULTIPLIER
      value: "2"
    resources:
      limits:
        nvidia.com/gpu: 1
      requests:
        cpu: 500m
        memory: 1Gi
        nvidia.com/gpu: 1

However, I am seeing models being unloaded and loaded while memory is 2522MiB / 15109MiB. I don't know why I can't get a higher utilization of gpu.

WaterKnight1998 commented 1 year ago

I saw that probably I was setting the configuration in a bad place: https://github.com/kserve/modelmesh/issues/46#issuecomment-1192388786

WaterKnight1998 commented 1 year ago

I get much better GPU utilization using:

apiVersion: serving.kserve.io/v1alpha1
kind: ServingRuntime
metadata:
  annotations:
    maxLoadingConcurrency: "2"
  labels:
    app.kubernetes.io/instance: modelmesh-controller
    app.kubernetes.io/managed-by: modelmesh-controller
    app.kubernetes.io/name: modelmesh-controller
    name: modelmesh-serving-triton-2.x-SR
  name: triton-2.x
  # namespace: inference-server
spec:
  builtInAdapter:
    memBufferBytes: 134217728
    modelLoadingTimeoutMillis: 90000
    runtimeManagementPort: 8001
    serverType: triton
    env:
    - name: CONTAINER_MEM_REQ_BYTES
      value: "12884901888" # Works for T4
    - name: MODELSIZE_MULTIPLIER
      value: "2"
  containers:
  - args:
    - -c
    - 'mkdir -p /models/_triton_models; chmod 777 /models/_triton_models; exec tritonserver
      "--model-repository=/models/_triton_models" "--model-control-mode=explicit"
      "--strict-model-config=false" "--strict-readiness=false" "--allow-http=true"
      "--allow-sagemaker=false" '
    command:
    - /bin/sh
    image: nvcr.io/nvidia/tritonserver:21.06.1-py3
    livenessProbe:
      exec:
        command:
        - curl
        - --fail
        - --silent
        - --show-error
        - --max-time
        - "9"
        - http://localhost:8000/v2/health/live
      initialDelaySeconds: 5
      periodSeconds: 30
      timeoutSeconds: 10
    name: triton
    resources:
      limits:
        nvidia.com/gpu: 1
      requests:
        cpu: 500m
        memory: 1Gi
        nvidia.com/gpu: 1

haiminh2001 commented 2 months ago

Thank you very much for the incredible project!

First of all, it would be very helpfull that you add a documentation on how to manage GPU memory while using Triton.

I was doing several test but I couldn't understand how the following env parameters works: CONTAINER_MEM_REQ_BYTES and MODELSIZE_MULTIPLIER. I read the following explanation: kserve/modelmesh#82 (comment)

I applied the following configuration for T4:

apiVersion: serving.kserve.io/v1alpha1
kind: ServingRuntime
metadata:
  annotations:
    maxLoadingConcurrency: "2"
  labels:
    app.kubernetes.io/instance: modelmesh-controller
    app.kubernetes.io/managed-by: modelmesh-controller
    app.kubernetes.io/name: modelmesh-controller
    name: modelmesh-serving-triton-2.x-SR
  name: triton-2.x
  # namespace: inference-server
spec:
  builtInAdapter:
    memBufferBytes: 134217728
    modelLoadingTimeoutMillis: 90000
    runtimeManagementPort: 8001
    serverType: triton
  containers:
  - args:
    - -c
    - 'mkdir -p /models/_triton_models; chmod 777 /models/_triton_models; exec tritonserver
      "--model-repository=/models/_triton_models" "--model-control-mode=explicit"
      "--strict-model-config=false" "--strict-readiness=false" "--allow-http=true"
      "--allow-sagemaker=false" '
    command:
    - /bin/sh
    image: nvcr.io/nvidia/tritonserver:21.06.1-py3
    livenessProbe:
      exec:
        command:
        - curl
        - --fail
        - --silent
        - --show-error
        - --max-time
        - "9"
        - http://localhost:8000/v2/health/live
      initialDelaySeconds: 5
      periodSeconds: 30
      timeoutSeconds: 10
    name: triton
    env:
    - name: CONTAINER_MEM_REQ_BYTES
      value: "12884901888" 
    - name: MODELSIZE_MULTIPLIER
      value: "2"
    resources:
      limits:
        nvidia.com/gpu: 1
      requests:
        cpu: 500m
        memory: 1Gi
        nvidia.com/gpu: 1

However, I am seeing models being unloaded and loaded while memory is 2522MiB / 15109MiB. I don't know why I can't get a higher utilization of gpu.

Hi, it has been almost a year since your question but today I came across your question. First of all thank you for your question that I know to do the model sizing. I hope you have solved your problem but if it is not then based on this issue you should place your env variables inside the buildInAdapter, not inside the containers :))

kserve / modelmesh-serving

Documentation about GPU memory #407