huggingface / text-generation-inference

Large Language Model Text Generation Inference
http://hf.co/docs/text-generation-inference
Apache License 2.0
9.15k stars 1.08k forks source link

OutOfMemory error running Meta-Llama-3.1-405B-Instruct-fp8 on 8xH100 #2572

Open ad01bl opened 2 months ago

ad01bl commented 2 months ago

System Info

TGI version: 2.2.0 (but I tested 2.3.0 too) Machine: 8x H100 (640 GPU RAM)

2024-09-25T14:29:44.260160Z  INFO text_generation_launcher: Runtime environment:
Target: x86_64-unknown-linux-gnu
Cargo version: 1.79.0
Commit sha: db7e043ded45e14ed24188d5a963911c96049618
Docker label: sha-db7e043
nvidia-smi:
Wed Sep 25 14:29:43 2024
   +---------------------------------------------------------------------------------------+
   | NVIDIA-SMI 535.154.05             Driver Version: 535.154.05   CUDA Version: 12.2     |
   |-----------------------------------------+----------------------+----------------------+
   | GPU  Name                 Persistence-M | Bus-Id        Disp.A | Volatile Uncorr. ECC |
   | Fan  Temp   Perf          Pwr:Usage/Cap |         Memory-Usage | GPU-Util  Compute M. |
   |                                         |                      |               MIG M. |
   |=========================================+======================+======================|
   |   0  NVIDIA H100 80GB HBM3          On  | 00000000:0F:00.0 Off |                    0 |
   | N/A   30C    P0             114W / 700W |      0MiB / 81559MiB |      0%      Default |
   |                                         |                      |             Disabled |
   +-----------------------------------------+----------------------+----------------------+
   |   1  NVIDIA H100 80GB HBM3          On  | 00000000:2D:00.0 Off |                    0 |
   | N/A   35C    P0             120W / 700W |      0MiB / 81559MiB |      0%      Default |
   |                                         |                      |             Disabled |
   +-----------------------------------------+----------------------+----------------------+
   |   2  NVIDIA H100 80GB HBM3          On  | 00000000:44:00.0 Off |                    0 |
   | N/A   31C    P0             115W / 700W |      0MiB / 81559MiB |      0%      Default |
   |                                         |                      |             Disabled |
   +-----------------------------------------+----------------------+----------------------+
   |   3  NVIDIA H100 80GB HBM3          On  | 00000000:5B:00.0 Off |                    0 |
   | N/A   36C    P0             115W / 700W |      0MiB / 81559MiB |      0%      Default |
   |                                         |                      |             Disabled |
   +-----------------------------------------+----------------------+----------------------+
   |   4  NVIDIA H100 80GB HBM3          On  | 00000000:89:00.0 Off |                    0 |
   | N/A   31C    P0             114W / 700W |      0MiB / 81559MiB |      0%      Default |
   |                                         |                      |             Disabled |
   +-----------------------------------------+----------------------+----------------------+
   |   5  NVIDIA H100 80GB HBM3          On  | 00000000:A8:00.0 Off |                    0 |
   | N/A   35C    P0             118W / 700W |      0MiB / 81559MiB |      0%      Default |
   |                                         |                      |             Disabled |
   +-----------------------------------------+----------------------+----------------------+
   |   6  NVIDIA H100 80GB HBM3          On  | 00000000:C0:00.0 Off |                    0 |
   | N/A   36C    P0             116W / 700W |      0MiB / 81559MiB |      0%      Default |
   |                                         |                      |             Disabled |
   +-----------------------------------------+----------------------+----------------------+
   |   7  NVIDIA H100 80GB HBM3          On  | 00000000:D8:00.0 Off |                    0 |
   | N/A   32C    P0             116W / 700W |      0MiB / 81559MiB |      0%      Default |
   |                                         |                      |             Disabled |
   +-----------------------------------------+----------------------+----------------------+

   +---------------------------------------------------------------------------------------+
   | Processes:                                                                            |
   |  GPU   GI   CI        PID   Type   Process name                            GPU Memory |
   |        ID   ID                                                             Usage      |
   |=======================================================================================|
   |  No running processes found                                                           |
   +---------------------------------------------------------------------------------------+
xpu-smi:
N/A

Information

Tasks

Reproduction

  1. My deployment yaml:
apiVersion: apps/v1
kind: Deployment
metadata:
  labels:
    app: tgi-llama
  name: tgi-llama
  namespace: llama-31
spec:
  selector:
    matchLabels:
      app: tgi-llama
  template:
    metadata:
      labels:
        app: tgi-llama
    spec:
      containers:
      - name: tgi-llama
        image: "ghcr.io/huggingface/text-generation-inference:2.2.0"
        args: ["--model-id", "meta-llama/Meta-Llama-3.1-405B-Instruct-fp8", "--sharded", "true", "--num-shard ", "8", "--env"]
        imagePullPolicy: IfNotPresent
        resources:
          limits:
            cpu: 100
            memory: 1000G
            nvidia.com/gpu: 8
        ports:
        - containerPort: 80
        volumeMounts:
        - mountPath: /data
          name: tgi-llama-disk
        - mountPath: /dev/shm
          name: dshm
        env:
        - name: HUGGING_FACE_HUB_TOKEN
          value: ""
        - name: MAX_TOTAL_TOKENS
          value: "13107"
        - name: MAX_INPUT_LENGTH
          value: "500"
        - name: MAX_BATCH_PREFILL_TOKENS
          value: "550"
        - name: HUGGINGFACE_HUB_CACHE
          value: "/data"
      restartPolicy: Always
      volumes:
      - name: tgi-llama-disk
        persistentVolumeClaim:
          claimName: tgi-llama-disk
      - name: dshm
        emptyDir:
           medium: Memory
      tolerations:
      - key: "nvidia.com/gpu"
        operator: "Exists"
        effect: NoSchedule
      - key: "model"
        operator: "Equal"
        effect: NoSchedule
        value: "llama31"
  1. Logs:
2024-09-25T14:29:44.260191Z  INFO text_generation_launcher: Args {
    model_id: "meta-llama/Meta-Llama-3.1-405B-Instruct-fp8",
    revision: None,
    validation_workers: 2,
    sharded: None,
    num_shard: Some(
        8,
    ),
    quantize: None,
    speculate: None,
    dtype: None,
    trust_remote_code: false,
    max_concurrent_requests: 128,
    max_best_of: 2,
    max_stop_sequences: 4,
    max_top_n_tokens: 5,
    max_input_tokens: None,
    max_input_length: Some(
        500,
    ),
    max_total_tokens: Some(
        13107,
    ),
    waiting_served_ratio: 0.3,
    max_batch_prefill_tokens: Some(
        550,
    ),
    max_batch_total_tokens: None,
    max_waiting_tokens: 20,
    max_batch_size: None,
    cuda_graphs: None,
    hostname: "tgi-llama-6dfd4d944f-vmdkw",
    port: 80,
    shard_uds_path: "/tmp/text-generation-server",
    master_addr: "localhost",
    master_port: 29500,
    huggingface_hub_cache: Some(
        "/data",
    ),
    weights_cache_override: None,
    disable_custom_kernels: false,
    cuda_memory_fraction: 1.0,
    rope_scaling: None,
    rope_factor: None,
    json_output: false,
    otlp_endpoint: None,
    otlp_service_name: "text-generation-inference.router",
    cors_allow_origin: [],
    watermark_gamma: None,
    watermark_delta: None,
    ngrok: false,
    ngrok_authtoken: None,
    ngrok_edge: None,
    tokenizer_config_path: None,
    disable_grammar_support: false,
    env: true,
    max_client_batch_size: 4,
    lora_adapters: None,
    disable_usage_stats: false,
    disable_crash_reports: false,
}
2024-09-25T14:29:44.260260Z  INFO hf_hub: Token file not found "/root/.cache/huggingface/token"
2024-09-25T14:29:44.441323Z  INFO text_generation_launcher: Using default cuda graphs [1, 2, 4, 8, 16, 32]
2024-09-25T14:29:44.441331Z  INFO text_generation_launcher: Sharding model on 8 processes
2024-09-25T14:29:44.441452Z  INFO download: text_generation_launcher: Starting check and download process for meta-llama/Meta-Llama-3.1-405B-Instruct-fp8
2024-09-25T15:00:51.799015Z  INFO download: text_generation_launcher: Successfully downloaded weights for meta-llama/Meta-Llama-3.1-405B-Instruct-fp8
2024-09-25T15:00:51.799235Z  INFO shard-manager: text_generation_launcher: Starting shard rank=0
2024-09-25T15:00:51.799251Z  INFO shard-manager: text_generation_launcher: Starting shard rank=1
2024-09-25T15:00:51.799601Z  INFO shard-manager: text_generation_launcher: Starting shard rank=2
2024-09-25T15:00:51.800066Z  INFO shard-manager: text_generation_launcher: Starting shard rank=3
2024-09-25T15:00:51.800097Z  INFO shard-manager: text_generation_launcher: Starting shard rank=4
2024-09-25T15:00:51.801546Z  INFO shard-manager: text_generation_launcher: Starting shard rank=5
2024-09-25T15:00:51.801585Z  INFO shard-manager: text_generation_launcher: Starting shard rank=6
2024-09-25T15:00:51.802622Z  INFO shard-manager: text_generation_launcher: Starting shard rank=7
2024-09-25T15:00:56.515337Z  INFO text_generation_launcher: Auto selecting quantization method fp8
2024-09-25T15:01:01.806057Z  INFO shard-manager: text_generation_launcher: Waiting for shard to be ready... rank=0
2024-09-25T15:01:01.807285Z  INFO shard-manager: text_generation_launcher: Waiting for shard to be ready... rank=1
2024-09-25T15:01:01.807322Z  INFO shard-manager: text_generation_launcher: Waiting for shard to be ready... rank=2
2024-09-25T15:01:01.807360Z  INFO shard-manager: text_generation_launcher: Waiting for shard to be ready... rank=4
2024-09-25T15:01:01.808804Z  INFO shard-manager: text_generation_launcher: Waiting for shard to be ready... rank=3
2024-09-25T15:01:01.809297Z  INFO shard-manager: text_generation_launcher: Waiting for shard to be ready... rank=6
2024-09-25T15:01:01.809605Z  INFO shard-manager: text_generation_launcher: Waiting for shard to be ready... rank=7
2024-09-25T15:01:01.814302Z  INFO shard-manager: text_generation_launcher: Waiting for shard to be ready... rank=5
2024-09-25T15:01:05.514208Z  INFO text_generation_launcher: Using FBGEMM fp8 optimized kernels
2024-09-25T15:04:30.363596Z  INFO text_generation_launcher: Server started at unix:///tmp/text-generation-server-2
2024-09-25T15:04:30.371516Z  INFO text_generation_launcher: Server started at unix:///tmp/text-generation-server-3
2024-09-25T15:04:30.372803Z  INFO text_generation_launcher: Server started at unix:///tmp/text-generation-server-4
2024-09-25T15:04:30.372919Z  INFO text_generation_launcher: Server started at unix:///tmp/text-generation-server-5
2024-09-25T15:04:30.372927Z  INFO text_generation_launcher: Server started at unix:///tmp/text-generation-server-7
2024-09-25T15:04:30.373540Z  INFO text_generation_launcher: Server started at unix:///tmp/text-generation-server-0
2024-09-25T15:04:30.373927Z  INFO text_generation_launcher: Server started at unix:///tmp/text-generation-server-1
2024-09-25T15:04:30.420621Z  INFO shard-manager: text_generation_launcher: Shard ready in 218.618910525s rank=4
2024-09-25T15:04:30.426690Z  INFO shard-manager: text_generation_launcher: Shard ready in 218.622944116s rank=7
2024-09-25T15:04:30.427452Z  INFO shard-manager: text_generation_launcher: Shard ready in 218.62400201s rank=5
2024-09-25T15:04:30.444388Z  INFO shard-manager: text_generation_launcher: Shard ready in 218.644204722s rank=0
2024-09-25T15:04:30.460515Z  INFO shard-manager: text_generation_launcher: Shard ready in 218.658884257s rank=2
2024-09-25T15:04:30.460530Z  INFO shard-manager: text_generation_launcher: Shard ready in 218.658891373s rank=1
2024-09-25T15:04:30.460532Z  INFO shard-manager: text_generation_launcher: Shard ready in 218.657400525s rank=3
2024-09-25T15:04:30.556841Z  INFO text_generation_launcher: Starting Webserver
2024-09-25T15:04:30.664794Z  INFO text_generation_router: router/src/main.rs:228: Using the Hugging Face API
2024-09-25T15:04:30.664836Z  INFO hf_hub: /usr/local/cargo/registry/src/index.crates.io-6f17d22bba15001f/hf-hub-0.3.2/src/lib.rs:55: Token file not found "/root/.cache/huggingface/token"   
2024-09-25T15:04:31.378511Z  INFO text_generation_router: router/src/main.rs:577: Serving revision 2147c7e74f1bf338ad11843e450ee174df547589 of model meta-llama/Meta-Llama-3.1-405B-Instruct-FP8
2024-09-25T15:04:31.597861Z  INFO text_generation_router: router/src/main.rs:357: Using config Some(Llama)
2024-09-25T15:04:31.597869Z  WARN text_generation_router: router/src/main.rs:384: Invalid hostname, defaulting to 0.0.0.0
2024-09-25T15:04:31.851898Z  INFO text_generation_router::server: router/src/server.rs:1572: Warming up model
2024-09-25T15:04:33.037820Z  INFO text_generation_launcher: Cuda Graphs are enabled for sizes [32, 16, 8, 4, 2, 1]
2024-09-25T15:04:34.456876Z ERROR text_generation_launcher: Method Warmup encountered an error.
Traceback (most recent call last):
2024-09-25T15:04:34.519240Z ERROR text_generation_launcher: Method Warmup encountered an error.
Traceback (most recent call last):
  File "/opt/conda/bin/text-generation-server", line 8, in <module>
    sys.exit(app())
  File "/opt/conda/lib/python3.10/site-packages/typer/main.py", line 311, in __call__
    return get_command(self)(*args, **kwargs)
  File "/opt/conda/lib/python3.10/site-packages/click/core.py", line 1157, in __call__
    return self.main(*args, **kwargs)
  File "/opt/conda/lib/python3.10/site-packages/typer/core.py", line 778, in main
    return _main(
  File "/opt/conda/lib/python3.10/site-packages/typer/core.py", line 216, in _main
    rv = self.invoke(ctx)
  File "/opt/conda/lib/python3.10/site-packages/click/core.py", line 1688, in invoke
    return _process_result(sub_ctx.command.invoke(sub_ctx))
  File "/opt/conda/lib/python3.10/site-packages/click/core.py", line 1434, in invoke
    return ctx.invoke(self.callback, **ctx.params)
  File "/opt/conda/lib/python3.10/site-packages/click/core.py", line 783, in invoke
    return __callback(*args, **kwargs)
  File "/opt/conda/lib/python3.10/site-packages/typer/main.py", line 683, in wrapper
    return callback(**use_params)  # type: ignore
  File "/opt/conda/lib/python3.10/site-packages/text_generation_server/cli.py", line 118, in serve
    server.serve(
  File "/opt/conda/lib/python3.10/site-packages/text_generation_server/server.py", line 297, in serve
    asyncio.run(
  File "/opt/conda/lib/python3.10/asyncio/runners.py", line 44, in run
    return loop.run_until_complete(main)
  File "/opt/conda/lib/python3.10/asyncio/base_events.py", line 636, in run_until_complete
    self.run_forever()
  File "/opt/conda/lib/python3.10/asyncio/base_events.py", line 603, in run_forever
    self._run_once()
  File "/opt/conda/lib/python3.10/asyncio/base_events.py", line 1909, in _run_once
    handle._run()
  File "/opt/conda/lib/python3.10/asyncio/events.py", line 80, in _run
    self._context.run(self._callback, *self._args)
  File "/opt/conda/lib/python3.10/site-packages/grpc_interceptor/server.py", line 165, in invoke_intercept_method
    return await self.intercept(
> File "/opt/conda/lib/python3.10/site-packages/text_generation_server/interceptor.py", line 21, in intercept
    return await response
  File "/opt/conda/lib/python3.10/site-packages/opentelemetry/instrumentation/grpc/_aio_server.py", line 120, in _unary_interceptor
    raise error
  File "/opt/conda/lib/python3.10/site-packages/opentelemetry/instrumentation/grpc/_aio_server.py", line 111, in _unary_interceptor
    return await behavior(request_or_iterator, context)
  File "/opt/conda/lib/python3.10/site-packages/text_generation_server/server.py", line 125, in Warmup
    max_supported_total_tokens = self.model.warmup(batch)
  File "/opt/conda/lib/python3.10/site-packages/text_generation_server/models/flash_causal_lm.py", line 1196, in warmup
    self.cuda_graph_warmup(bs, max_s, max_bt)
  File "/opt/conda/lib/python3.10/site-packages/text_generation_server/models/flash_causal_lm.py", line 1065, in cuda_graph_warmup
    with torch.cuda.graph(graph, pool=MEM_POOL):
  File "/opt/conda/lib/python3.10/site-packages/torch/cuda/graphs.py", line 184, in __exit__
    self.cuda_graph.capture_end()
  File "/opt/conda/lib/python3.10/site-packages/torch/cuda/graphs.py", line 82, in capture_end
    super().capture_end()
RuntimeError: CUDA error: out of memory
CUDA kernel errors might be asynchronously reported at some other API call, so the stacktrace below might be incorrect.
For debugging consider passing CUDA_LAUNCH_BLOCKING=1.
Compile with `TORCH_USE_CUDA_DSA` to enable device-side assertions.
2024-09-25T15:04:34.598137Z ERROR warmup{max_input_length=500 max_prefill_tokens=550 max_total_tokens=13107 max_batch_size=None}:warmup: text_generation_client: router/client/src/lib.rs:46: Server error: CANCELLED
2024-09-25T15:04:34.617895Z ERROR warmup{max_input_length=500 max_prefill_tokens=550 max_total_tokens=13107 max_batch_size=None}:warmup: text_generation_client: router/client/src/lib.rs:46: Server error: CANCELLED
2024-09-25T15:04:34.650181Z ERROR warmup{max_input_length=500 max_prefill_tokens=550 max_total_tokens=13107 max_batch_size=None}:warmup: text_generation_client: router/client/src/lib.rs:46: Server error: CANCELLED
2024-09-25T15:04:34.677632Z ERROR warmup{max_input_length=500 max_prefill_tokens=550 max_total_tokens=13107 max_batch_size=None}:warmup: text_generation_client: router/client/src/lib.rs:46: Server error: CANCELLED
2024-09-25T15:04:34.680492Z ERROR warmup{max_input_length=500 max_prefill_tokens=550 max_total_tokens=13107 max_batch_size=None}:warmup: text_generation_client: router/client/src/lib.rs:46: Server error: CANCELLED
2024-09-25T15:04:34.701973Z ERROR warmup{max_input_length=500 max_prefill_tokens=550 max_total_tokens=13107 max_batch_size=None}:warmup: text_generation_client: router/client/src/lib.rs:46: Server error: CANCELLED
2024-09-25T15:04:34.707007Z ERROR warmup{max_input_length=500 max_prefill_tokens=550 max_total_tokens=13107 max_batch_size=None}:warmup: text_generation_client: router/client/src/lib.rs:46: Server error: CANCELLED
2024-09-25T15:04:34.713119Z ERROR warmup{max_input_length=500 max_prefill_tokens=550 max_total_tokens=13107 max_batch_size=None}:warmup: text_generation_client: router/client/src/lib.rs:46: Server error: CANCELLED
Error: WebServer(Warmup(Generation("CANCELLED")))
2024-09-25T15:04:34.954646Z ERROR text_generation_launcher: Webserver Crashed
2024-09-25T15:04:34.954664Z  INFO text_generation_launcher: Shutting down shards
2024-09-25T15:04:34.963134Z  INFO shard-manager: text_generation_launcher: Terminating shard rank=2
2024-09-25T15:04:34.963148Z  INFO shard-manager: text_generation_launcher: Terminating shard rank=3
2024-09-25T15:04:34.963165Z  INFO shard-manager: text_generation_launcher: Terminating shard rank=1
2024-09-25T15:04:34.964271Z  INFO shard-manager: text_generation_launcher: Waiting for shard to gracefully shutdown rank=2
2024-09-25T15:04:34.964340Z  INFO shard-manager: text_generation_launcher: Waiting for shard to gracefully shutdown rank=3
2024-09-25T15:04:34.964421Z  INFO shard-manager: text_generation_launcher: Waiting for shard to gracefully shutdown rank=1
2024-09-25T15:04:35.023355Z  INFO shard-manager: text_generation_launcher: Terminating shard rank=4
2024-09-25T15:04:35.024172Z  INFO shard-manager: text_generation_launcher: Waiting for shard to gracefully shutdown rank=4
2024-09-25T15:04:35.029462Z  INFO shard-manager: text_generation_launcher: Terminating shard rank=7
2024-09-25T15:04:35.030347Z  INFO shard-manager: text_generation_launcher: Waiting for shard to gracefully shutdown rank=7
2024-09-25T15:04:35.030945Z  INFO shard-manager: text_generation_launcher: Terminating shard rank=6
2024-09-25T15:04:35.032281Z  INFO shard-manager: text_generation_launcher: Waiting for shard to gracefully shutdown rank=6
2024-09-25T15:04:35.032512Z  INFO shard-manager: text_generation_launcher: Terminating shard rank=5
2024-09-25T15:04:35.034027Z  INFO shard-manager: text_generation_launcher: Waiting for shard to gracefully shutdown rank=5
2024-09-25T15:04:35.047083Z  INFO shard-manager: text_generation_launcher: Terminating shard rank=0
2024-09-25T15:04:35.047903Z  INFO shard-manager: text_generation_launcher: Waiting for shard to gracefully shutdown rank=0
2024-09-25T15:04:35.364752Z  INFO shard-manager: text_generation_launcher: shard terminated rank=3
2024-09-25T15:04:35.465564Z  INFO shard-manager: text_generation_launcher: shard terminated rank=1
2024-09-25T15:04:35.764901Z  INFO shard-manager: text_generation_launcher: shard terminated rank=2
2024-09-25T15:04:35.931027Z  INFO shard-manager: text_generation_launcher: shard terminated rank=7
2024-09-25T15:04:36.024913Z  INFO shard-manager: text_generation_launcher: shard terminated rank=4
2024-09-25T15:04:36.248767Z  INFO shard-manager: text_generation_launcher: shard terminated rank=0
2024-09-25T15:04:36.333451Z  INFO shard-manager: text_generation_launcher: shard terminated rank=6
2024-09-25T15:04:36.635381Z  INFO shard-manager: text_generation_launcher: shard terminated rank=5
Error: WebserverFailed

Expected behavior

Meta-Llama-3.1-405B-Instruct-fp8 starts with at least 10k token. I'm aware that there are reported problems with llama3.1 to run with full context 128k, but I can't even go with 500 due to OOM error.

Meta-Llama-3.1-405B-Instruct-fp8 requires 400 GPU RAM to start the model and my chine contains totally 640 so I thought it should be sufficient value.

Blair-Johnson commented 2 months ago

What happens if you pass --quantize fp8?

monnetb commented 1 month ago

--quantize fp8

This option should not be required for pre-quantized models like Meta-Llama-3.1-405B-Instruct-fp8

tested anyway -> same memory issue