torch.cuda.OutOfMemoryError: CUDA out of memory

System Info

Server is a self hosted supermicro server with (2) Tesla T4s.
MemTotal: 196674144 kB Intel(R) Xeon(R) Silver 4210 CPU @ 2.20GHz Ubuntu Server 22.04

2023-12-15T22:20:50.795055Z  INFO text_generation_launcher: Runtime environment:
Target: x86_64-unknown-linux-gnu
Cargo version: 1.70.0
Commit sha: f3aea78fb642967838e7b5b1940a25fe67f4f7a9
Docker label: sha-f3aea78
nvidia-smi:
Fri Dec 15 22:20:50 2023       
   +---------------------------------------------------------------------------------------+
   | NVIDIA-SMI 535.129.03             Driver Version: 535.129.03   CUDA Version: 12.2     |
   |-----------------------------------------+----------------------+----------------------+
   | GPU  Name                 Persistence-M | Bus-Id        Disp.A | Volatile Uncorr. ECC |
   | Fan  Temp   Perf          Pwr:Usage/Cap |         Memory-Usage | GPU-Util  Compute M. |
   |                                         |                      |               MIG M. |
   |=========================================+======================+======================|
   |   0  Tesla T4                       Off | 00000000:86:00.0 Off |                    0 |
   | N/A   31C    P8               9W /  70W |      2MiB / 15360MiB |      0%      Default |
   |                                         |                      |                  N/A |
   +-----------------------------------------+----------------------+----------------------+
   |   1  Tesla T4                       Off | 00000000:AF:00.0 Off |                    0 |
   | N/A   31C    P8              10W /  70W |      2MiB / 15360MiB |      0%      Default |
   |                                         |                      |                  N/A |
   +-----------------------------------------+----------------------+----------------------+

   +---------------------------------------------------------------------------------------+
   | Processes:                                                                            |
   |  GPU   GI   CI        PID   Type   Process name                            GPU Memory |
   |        ID   ID                                                             Usage      |
   |=======================================================================================|
   |  No running processes found                                                           |
   +---------------------------------------------------------------------------------------+

Information

[X] Docker
[ ] The CLI directly

Tasks

[X] An officially supported command
[ ] My own modifications

Reproduction

model=tiiuae/falcon-7b-instruct
volume=$PWD/data # share a volume with the Docker container to avoid downloading weights every run

docker run --gpus all --shm-size 1g -p 8080:80 -v $volume:/data ghcr.io/huggingface/text-generation-inference:1.3 --model-id $model

Source: https://huggingface.co/docs/text-generation-inference/quicktour

Expected behavior

I expect the model to run in docker. However, I see the following errors. I see in the documentation that this model support the Tesla T4 GPU which I'm running here.

When I monitor the graphics card when I start the docker image, I can see memory usage growing until it consumes all of the memory on both cards.

How can I adjust the docker command to run this model without exhausting the memory on the GPU?

sudo docker run --gpus all --shm-size 1g -p 8080:80 -v $volume:/data ghcr.io/huggingface/text-generation-inference:1.3 --model-id $model
2023-12-15T17:31:38.168114Z  INFO text_generation_launcher: Args { model_id: "tiiuae/falcon-7b-instruct", revision: None, validation_workers: 2, sharded: None, num_shard: None, quantize: None, speculate: None, dtype: None, trust_remote_code: false, max_concurrent_requests: 128, max_best_of: 2, max_stop_sequences: 4, max_top_n_tokens: 5, max_input_length: 1024, max_total_tokens: 2048, waiting_served_ratio: 1.2, max_batch_prefill_tokens: 4096, max_batch_total_tokens: None, max_waiting_tokens: 20, hostname: "eb34a9e2700a", port: 80, shard_uds_path: "/tmp/text-generation-server", master_addr: "localhost", master_port: 29500, huggingface_hub_cache: Some("/data"), weights_cache_override: None, disable_custom_kernels: false, cuda_memory_fraction: 1.0, rope_scaling: None, rope_factor: None, json_output: false, otlp_endpoint: None, cors_allow_origin: [], watermark_gamma: None, watermark_delta: None, ngrok: false, ngrok_authtoken: None, ngrok_edge: None, env: false }
2023-12-15T17:31:38.168243Z  INFO download: text_generation_launcher: Starting download process.
2023-12-15T17:31:42.363020Z  INFO text_generation_launcher: Files are already present on the host. Skipping download.

2023-12-15T17:31:43.074924Z  INFO download: text_generation_launcher: Successfully downloaded weights.
2023-12-15T17:31:43.075251Z  INFO shard-manager: text_generation_launcher: Starting shard rank=0
2023-12-15T17:31:46.746285Z  WARN text_generation_launcher: Unable to use Flash Attention V2: GPU with CUDA capability 7 5 is not supported for Flash Attention V2

2023-12-15T17:31:46.766906Z  WARN text_generation_launcher: Could not import Mistral model: Mistral model requires flash attn v2

2023-12-15T17:31:46.767534Z  WARN text_generation_launcher: Could not import Mixtral model: Mistral model requires flash attn v2

2023-12-15T17:31:53.086545Z  INFO shard-manager: text_generation_launcher: Waiting for shard to be ready... rank=0
2023-12-15T17:31:53.545258Z  INFO text_generation_launcher: Server started at unix:///tmp/text-generation-server-0

2023-12-15T17:31:53.587029Z  INFO shard-manager: text_generation_launcher: Shard ready in 10.510932164s rank=0
2023-12-15T17:31:53.685270Z  INFO text_generation_launcher: Starting Webserver
2023-12-15T17:31:54.405520Z  WARN text_generation_router: router/src/main.rs:349: `--revision` is not set
2023-12-15T17:31:54.405547Z  WARN text_generation_router: router/src/main.rs:350: We strongly advise to set it to a known supported commit.
2023-12-15T17:31:54.543768Z  INFO text_generation_router: router/src/main.rs:371: Serving revision cf4b3c42ce2fdfe24f753f0f0d179202fea59c99 of model tiiuae/falcon-7b-instruct
2023-12-15T17:31:54.549183Z  INFO text_generation_router: router/src/main.rs:213: Warming up model
2023-12-15T17:31:57.823488Z ERROR text_generation_launcher: Method Warmup encountered an error.
Traceback (most recent call last):
  File "/opt/conda/lib/python3.10/site-packages/text_generation_server/models/flash_causal_lm.py", line 693, in warmup
    _, batch, _ = self.generate_token(batch)
  File "/opt/conda/lib/python3.10/contextlib.py", line 79, in inner
    return func(*args, **kwds)
  File "/opt/conda/lib/python3.10/site-packages/text_generation_server/models/flash_causal_lm.py", line 936, in generate_token
    prefill_logprobs_tensor = torch.log_softmax(out, -1)
torch.cuda.OutOfMemoryError: CUDA out of memory. Tried to allocate 508.00 MiB. GPU 0 has a total capacty of 14.58 GiB of which 275.56 MiB is free. Process 14002 has 14.31 GiB memory in use. Of the allocated memory 14.04 GiB is allocated by PyTorch, and 142.95 MiB is reserved by PyTorch but unallocated. If reserved but unallocated memory is large try setting max_split_size_mb to avoid fragmentation.  See documentation for Memory Management and PYTORCH_CUDA_ALLOC_CONF

The above exception was the direct cause of the following exception:

Traceback (most recent call last):
  File "/opt/conda/bin/text-generation-server", line 8, in <module>
    sys.exit(app())
  File "/opt/conda/lib/python3.10/site-packages/typer/main.py", line 311, in __call__
    return get_command(self)(*args, **kwargs)
  File "/opt/conda/lib/python3.10/site-packages/click/core.py", line 1157, in __call__
    return self.main(*args, **kwargs)
  File "/opt/conda/lib/python3.10/site-packages/typer/core.py", line 778, in main
    return _main(
  File "/opt/conda/lib/python3.10/site-packages/typer/core.py", line 216, in _main
    rv = self.invoke(ctx)
  File "/opt/conda/lib/python3.10/site-packages/click/core.py", line 1688, in invoke
    return _process_result(sub_ctx.command.invoke(sub_ctx))
  File "/opt/conda/lib/python3.10/site-packages/click/core.py", line 1434, in invoke
    return ctx.invoke(self.callback, **ctx.params)
  File "/opt/conda/lib/python3.10/site-packages/click/core.py", line 783, in invoke
    return __callback(*args, **kwargs)
  File "/opt/conda/lib/python3.10/site-packages/typer/main.py", line 683, in wrapper
    return callback(**use_params)  # type: ignore
  File "/opt/conda/lib/python3.10/site-packages/text_generation_server/cli.py", line 89, in serve
    server.serve(
  File "/opt/conda/lib/python3.10/site-packages/text_generation_server/server.py", line 228, in serve
    asyncio.run(
  File "/opt/conda/lib/python3.10/asyncio/runners.py", line 44, in run
    return loop.run_until_complete(main)
  File "/opt/conda/lib/python3.10/asyncio/base_events.py", line 636, in run_until_complete
    self.run_forever()
  File "/opt/conda/lib/python3.10/asyncio/base_events.py", line 603, in run_forever
    self._run_once()
  File "/opt/conda/lib/python3.10/asyncio/base_events.py", line 1909, in _run_once
    handle._run()
  File "/opt/conda/lib/python3.10/asyncio/events.py", line 80, in _run
    self._context.run(self._callback, *self._args)
  File "/opt/conda/lib/python3.10/site-packages/grpc_interceptor/server.py", line 165, in invoke_intercept_method
    return await self.intercept(
> File "/opt/conda/lib/python3.10/site-packages/text_generation_server/interceptor.py", line 21, in intercept
    return await response
  File "/opt/conda/lib/python3.10/site-packages/opentelemetry/instrumentation/grpc/_aio_server.py", line 82, in _unary_interceptor
    raise error
  File "/opt/conda/lib/python3.10/site-packages/opentelemetry/instrumentation/grpc/_aio_server.py", line 73, in _unary_interceptor
    return await behavior(request_or_iterator, context)
  File "/opt/conda/lib/python3.10/site-packages/text_generation_server/server.py", line 73, in Warmup
    max_supported_total_tokens = self.model.warmup(batch)
  File "/opt/conda/lib/python3.10/site-packages/text_generation_server/models/flash_causal_lm.py", line 695, in warmup
    raise RuntimeError(
RuntimeError: Not enough memory to handle 4 prefill tokens. You need to decrease `--max-batch-prefill-tokens`

2023-12-15T17:31:57.823884Z ERROR warmup{max_input_length=1024 max_prefill_tokens=4096 max_total_tokens=2048}:warmup: text_generation_client: router/client/src/lib.rs:33: Server error: Not enough memory to handle 4 prefill tokens. You need to decrease `--max-batch-prefill-tokens`
Error: Warmup(Generation("Not enough memory to handle 4 prefill tokens. You need to decrease `--max-batch-prefill-tokens`"))
2023-12-15T17:31:57.891899Z ERROR text_generation_launcher: Webserver Crashed
2023-12-15T17:31:57.891933Z  INFO text_generation_launcher: Shutting down shards
2023-12-15T17:31:58.102838Z  INFO shard-manager: text_generation_launcher: Shard terminated rank=0
Error: WebserverFailed

I tried this new command. I had to change bfloat16 to float16. Apparently the Tesla T4 does not support it.

sudo docker run --gpus all --shm-size 1g -p 8080:80 -v /home/deeznnutz/discourse/data:/data ghcr.io/huggingface/text-generation-inference:1.3 --model-id tiiuae/falcon-7b-instruct --sharded true --num-shard 2 --dtype float16

I got a new error

ValueError: `num_heads` must be divisible by `num_shards` (got `num_heads`: 71 and `num_shards`: 2
 rank=0

Full results.

2023-12-16T09:36:12.416903Z  INFO text_generation_launcher: Args { model_id: "tiiuae/falcon-7b-instruct", revision: None, validation_workers: 2, sharded: Some(true), num_shard: Some(2), quantize: None, speculate: None, dtype: Some(Float16), trust_remote_code: false, max_concurrent_requests: 128, max_best_of: 2, max_stop_sequences: 4, max_top_n_tokens: 5, max_input_length: 1024, max_total_tokens: 2048, waiting_served_ratio: 1.2, max_batch_prefill_tokens: 4096, max_batch_total_tokens: None, max_waiting_tokens: 20, hostname: "09420e26d9c7", port: 80, shard_uds_path: "/tmp/text-generation-server", master_addr: "localhost", master_port: 29500, huggingface_hub_cache: Some("/data"), weights_cache_override: None, disable_custom_kernels: false, cuda_memory_fraction: 1.0, rope_scaling: None, rope_factor: None, json_output: false, otlp_endpoint: None, cors_allow_origin: [], watermark_gamma: None, watermark_delta: None, ngrok: false, ngrok_authtoken: None, ngrok_edge: None, env: false }
2023-12-16T09:36:12.416936Z  INFO text_generation_launcher: Sharding model on 2 processes
2023-12-16T09:36:12.417062Z  INFO download: text_generation_launcher: Starting download process.
2023-12-16T09:36:16.298687Z  INFO text_generation_launcher: Files are already present on the host. Skipping download.

2023-12-16T09:36:16.923126Z  INFO download: text_generation_launcher: Successfully downloaded weights.
2023-12-16T09:36:16.923474Z  INFO shard-manager: text_generation_launcher: Starting shard rank=0
2023-12-16T09:36:16.923525Z  INFO shard-manager: text_generation_launcher: Starting shard rank=1
2023-12-16T09:36:20.687988Z  WARN text_generation_launcher: Disabling exllama v2 and using v1 instead because there are issues when sharding

2023-12-16T09:36:20.779303Z  WARN text_generation_launcher: Disabling exllama v2 and using v1 instead because there are issues when sharding

2023-12-16T09:36:20.786712Z  WARN text_generation_launcher: Unable to use Flash Attention V2: GPU with CUDA capability 7 5 is not supported for Flash Attention V2

2023-12-16T09:36:20.807420Z  WARN text_generation_launcher: Could not import Mistral model: Mistral model requires flash attn v2

2023-12-16T09:36:20.808055Z  WARN text_generation_launcher: Could not import Mixtral model: Mistral model requires flash attn v2

2023-12-16T09:36:20.850455Z  WARN text_generation_launcher: Unable to use Flash Attention V2: GPU with CUDA capability 7 5 is not supported for Flash Attention V2

2023-12-16T09:36:20.872033Z  WARN text_generation_launcher: Could not import Mistral model: Mistral model requires flash attn v2

2023-12-16T09:36:20.872711Z  WARN text_generation_launcher: Could not import Mixtral model: Mistral model requires flash attn v2

2023-12-16T09:36:23.274668Z ERROR text_generation_launcher: Error when initializing model
Traceback (most recent call last):
  File "/opt/conda/bin/text-generation-server", line 8, in <module>
    sys.exit(app())
  File "/opt/conda/lib/python3.10/site-packages/typer/main.py", line 311, in __call__
    return get_command(self)(*args, **kwargs)
  File "/opt/conda/lib/python3.10/site-packages/click/core.py", line 1157, in __call__
    return self.main(*args, **kwargs)
  File "/opt/conda/lib/python3.10/site-packages/typer/core.py", line 778, in main
    return _main(
  File "/opt/conda/lib/python3.10/site-packages/typer/core.py", line 216, in _main
    rv = self.invoke(ctx)
  File "/opt/conda/lib/python3.10/site-packages/click/core.py", line 1688, in invoke
    return _process_result(sub_ctx.command.invoke(sub_ctx))
  File "/opt/conda/lib/python3.10/site-packages/click/core.py", line 1434, in invoke
    return ctx.invoke(self.callback, **ctx.params)
  File "/opt/conda/lib/python3.10/site-packages/click/core.py", line 783, in invoke
    return __callback(*args, **kwargs)
  File "/opt/conda/lib/python3.10/site-packages/typer/main.py", line 683, in wrapper
    return callback(**use_params)  # type: ignore
  File "/opt/conda/lib/python3.10/site-packages/text_generation_server/cli.py", line 89, in serve
    server.serve(
  File "/opt/conda/lib/python3.10/site-packages/text_generation_server/server.py", line 228, in serve
    asyncio.run(
  File "/opt/conda/lib/python3.10/asyncio/runners.py", line 44, in run
    return loop.run_until_complete(main)
  File "/opt/conda/lib/python3.10/asyncio/base_events.py", line 636, in run_until_complete
    self.run_forever()
  File "/opt/conda/lib/python3.10/asyncio/base_events.py", line 603, in run_forever
    self._run_once()
  File "/opt/conda/lib/python3.10/asyncio/base_events.py", line 1909, in _run_once
    handle._run()
  File "/opt/conda/lib/python3.10/asyncio/events.py", line 80, in _run
    self._context.run(self._callback, *self._args)
> File "/opt/conda/lib/python3.10/site-packages/text_generation_server/server.py", line 174, in serve_inner
    model = get_model(
  File "/opt/conda/lib/python3.10/site-packages/text_generation_server/models/__init__.py", line 271, in get_model
    return FlashRWSharded(
  File "/opt/conda/lib/python3.10/site-packages/text_generation_server/models/flash_rw.py", line 67, in __init__
    model = FlashRWForCausalLM(config, weights)
  File "/opt/conda/lib/python3.10/site-packages/text_generation_server/models/custom_modeling/flash_rw_modeling.py", line 614, in __init__
    self.transformer = FlashRWModel(config, weights)
  File "/opt/conda/lib/python3.10/site-packages/text_generation_server/models/custom_modeling/flash_rw_modeling.py", line 556, in __init__
    [
  File "/opt/conda/lib/python3.10/site-packages/text_generation_server/models/custom_modeling/flash_rw_modeling.py", line 557, in <listcomp>
    FlashRWLayer(layer_id, config, weights)
  File "/opt/conda/lib/python3.10/site-packages/text_generation_server/models/custom_modeling/flash_rw_modeling.py", line 385, in __init__
    self.self_attention = FlashRWAttention(
  File "/opt/conda/lib/python3.10/site-packages/text_generation_server/models/custom_modeling/flash_rw_modeling.py", line 138, in __init__
    raise ValueError(
ValueError: `num_heads` must be divisible by `num_shards` (got `num_heads`: 71 and `num_shards`: 2

2023-12-16T09:36:23.280687Z ERROR text_generation_launcher: Error when initializing model
Traceback (most recent call last):
  File "/opt/conda/bin/text-generation-server", line 8, in <module>
    sys.exit(app())
  File "/opt/conda/lib/python3.10/site-packages/typer/main.py", line 311, in __call__
    return get_command(self)(*args, **kwargs)
  File "/opt/conda/lib/python3.10/site-packages/click/core.py", line 1157, in __call__
    return self.main(*args, **kwargs)
  File "/opt/conda/lib/python3.10/site-packages/typer/core.py", line 778, in main
    return _main(
  File "/opt/conda/lib/python3.10/site-packages/typer/core.py", line 216, in _main
    rv = self.invoke(ctx)
  File "/opt/conda/lib/python3.10/site-packages/click/core.py", line 1688, in invoke
    return _process_result(sub_ctx.command.invoke(sub_ctx))
  File "/opt/conda/lib/python3.10/site-packages/click/core.py", line 1434, in invoke
    return ctx.invoke(self.callback, **ctx.params)
  File "/opt/conda/lib/python3.10/site-packages/click/core.py", line 783, in invoke
    return __callback(*args, **kwargs)
  File "/opt/conda/lib/python3.10/site-packages/typer/main.py", line 683, in wrapper
    return callback(**use_params)  # type: ignore
  File "/opt/conda/lib/python3.10/site-packages/text_generation_server/cli.py", line 89, in serve
    server.serve(
  File "/opt/conda/lib/python3.10/site-packages/text_generation_server/server.py", line 228, in serve
    asyncio.run(
  File "/opt/conda/lib/python3.10/asyncio/runners.py", line 44, in run
    return loop.run_until_complete(main)
  File "/opt/conda/lib/python3.10/asyncio/base_events.py", line 636, in run_until_complete
    self.run_forever()
  File "/opt/conda/lib/python3.10/asyncio/base_events.py", line 603, in run_forever
    self._run_once()
  File "/opt/conda/lib/python3.10/asyncio/base_events.py", line 1909, in _run_once
    handle._run()
  File "/opt/conda/lib/python3.10/asyncio/events.py", line 80, in _run
    self._context.run(self._callback, *self._args)
> File "/opt/conda/lib/python3.10/site-packages/text_generation_server/server.py", line 174, in serve_inner
    model = get_model(
  File "/opt/conda/lib/python3.10/site-packages/text_generation_server/models/__init__.py", line 271, in get_model
    return FlashRWSharded(
  File "/opt/conda/lib/python3.10/site-packages/text_generation_server/models/flash_rw.py", line 67, in __init__
    model = FlashRWForCausalLM(config, weights)
  File "/opt/conda/lib/python3.10/site-packages/text_generation_server/models/custom_modeling/flash_rw_modeling.py", line 614, in __init__
    self.transformer = FlashRWModel(config, weights)
  File "/opt/conda/lib/python3.10/site-packages/text_generation_server/models/custom_modeling/flash_rw_modeling.py", line 556, in __init__
    [
  File "/opt/conda/lib/python3.10/site-packages/text_generation_server/models/custom_modeling/flash_rw_modeling.py", line 557, in <listcomp>
    FlashRWLayer(layer_id, config, weights)
  File "/opt/conda/lib/python3.10/site-packages/text_generation_server/models/custom_modeling/flash_rw_modeling.py", line 385, in __init__
    self.self_attention = FlashRWAttention(
  File "/opt/conda/lib/python3.10/site-packages/text_generation_server/models/custom_modeling/flash_rw_modeling.py", line 138, in __init__
    raise ValueError(
ValueError: `num_heads` must be divisible by `num_shards` (got `num_heads`: 71 and `num_shards`: 2

2023-12-16T09:36:24.532439Z ERROR shard-manager: text_generation_launcher: Shard complete standard error output:

You are using a model of type falcon to instantiate a model of type . This is not supported for all configurations of models and can yield errors.
Traceback (most recent call last):

  File "/opt/conda/bin/text-generation-server", line 8, in <module>
    sys.exit(app())

  File "/opt/conda/lib/python3.10/site-packages/text_generation_server/cli.py", line 89, in serve
    server.serve(

  File "/opt/conda/lib/python3.10/site-packages/text_generation_server/server.py", line 228, in serve
    asyncio.run(

  File "/opt/conda/lib/python3.10/asyncio/runners.py", line 44, in run
    return loop.run_until_complete(main)

  File "/opt/conda/lib/python3.10/asyncio/base_events.py", line 649, in run_until_complete
    return future.result()

  File "/opt/conda/lib/python3.10/site-packages/text_generation_server/server.py", line 174, in serve_inner
    model = get_model(

  File "/opt/conda/lib/python3.10/site-packages/text_generation_server/models/__init__.py", line 271, in get_model
    return FlashRWSharded(

  File "/opt/conda/lib/python3.10/site-packages/text_generation_server/models/flash_rw.py", line 67, in __init__
    model = FlashRWForCausalLM(config, weights)

  File "/opt/conda/lib/python3.10/site-packages/text_generation_server/models/custom_modeling/flash_rw_modeling.py", line 614, in __init__
    self.transformer = FlashRWModel(config, weights)

  File "/opt/conda/lib/python3.10/site-packages/text_generation_server/models/custom_modeling/flash_rw_modeling.py", line 556, in __init__
    [

  File "/opt/conda/lib/python3.10/site-packages/text_generation_server/models/custom_modeling/flash_rw_modeling.py", line 557, in <listcomp>
    FlashRWLayer(layer_id, config, weights)

  File "/opt/conda/lib/python3.10/site-packages/text_generation_server/models/custom_modeling/flash_rw_modeling.py", line 385, in __init__
    self.self_attention = FlashRWAttention(

  File "/opt/conda/lib/python3.10/site-packages/text_generation_server/models/custom_modeling/flash_rw_modeling.py", line 138, in __init__
    raise ValueError(

ValueError: `num_heads` must be divisible by `num_shards` (got `num_heads`: 71 and `num_shards`: 2
 rank=0
2023-12-16T09:36:24.632149Z ERROR text_generation_launcher: Shard 0 failed to start
2023-12-16T09:36:24.632181Z  INFO text_generation_launcher: Shutting down shards
2023-12-16T09:36:24.652173Z  INFO shard-manager: text_generation_launcher: Shard terminated rank=1
Error: ShardCannotStart

huggingface / text-generation-inference