Closed 0x3639 closed 10 months ago
First, try setting dtype
to bfloat16
, this will roughly halve the memory usage if the model is loading as float32 by default. If that fails, try setting sharding
to true and num_shard
to 2 to split the model across your GPUs. After that, look at changing your maximum batch token numbers like the error message suggests. The sequence lengths that you allow will affect memory consumption.
(I don't have the CLI in front of me, so these may not be the exact parameter names)
I tried this new command. I had to change bfloat16
to float16
. Apparently the Tesla T4 does not support it.
sudo docker run --gpus all --shm-size 1g -p 8080:80 -v /home/deeznnutz/discourse/data:/data ghcr.io/huggingface/text-generation-inference:1.3 --model-id tiiuae/falcon-7b-instruct --sharded true --num-shard 2 --dtype float16
I got a new error
ValueError: `num_heads` must be divisible by `num_shards` (got `num_heads`: 71 and `num_shards`: 2
rank=0
Full results.
2023-12-16T09:36:12.416903Z INFO text_generation_launcher: Args { model_id: "tiiuae/falcon-7b-instruct", revision: None, validation_workers: 2, sharded: Some(true), num_shard: Some(2), quantize: None, speculate: None, dtype: Some(Float16), trust_remote_code: false, max_concurrent_requests: 128, max_best_of: 2, max_stop_sequences: 4, max_top_n_tokens: 5, max_input_length: 1024, max_total_tokens: 2048, waiting_served_ratio: 1.2, max_batch_prefill_tokens: 4096, max_batch_total_tokens: None, max_waiting_tokens: 20, hostname: "09420e26d9c7", port: 80, shard_uds_path: "/tmp/text-generation-server", master_addr: "localhost", master_port: 29500, huggingface_hub_cache: Some("/data"), weights_cache_override: None, disable_custom_kernels: false, cuda_memory_fraction: 1.0, rope_scaling: None, rope_factor: None, json_output: false, otlp_endpoint: None, cors_allow_origin: [], watermark_gamma: None, watermark_delta: None, ngrok: false, ngrok_authtoken: None, ngrok_edge: None, env: false }
2023-12-16T09:36:12.416936Z INFO text_generation_launcher: Sharding model on 2 processes
2023-12-16T09:36:12.417062Z INFO download: text_generation_launcher: Starting download process.
2023-12-16T09:36:16.298687Z INFO text_generation_launcher: Files are already present on the host. Skipping download.
2023-12-16T09:36:16.923126Z INFO download: text_generation_launcher: Successfully downloaded weights.
2023-12-16T09:36:16.923474Z INFO shard-manager: text_generation_launcher: Starting shard rank=0
2023-12-16T09:36:16.923525Z INFO shard-manager: text_generation_launcher: Starting shard rank=1
2023-12-16T09:36:20.687988Z WARN text_generation_launcher: Disabling exllama v2 and using v1 instead because there are issues when sharding
2023-12-16T09:36:20.779303Z WARN text_generation_launcher: Disabling exllama v2 and using v1 instead because there are issues when sharding
2023-12-16T09:36:20.786712Z WARN text_generation_launcher: Unable to use Flash Attention V2: GPU with CUDA capability 7 5 is not supported for Flash Attention V2
2023-12-16T09:36:20.807420Z WARN text_generation_launcher: Could not import Mistral model: Mistral model requires flash attn v2
2023-12-16T09:36:20.808055Z WARN text_generation_launcher: Could not import Mixtral model: Mistral model requires flash attn v2
2023-12-16T09:36:20.850455Z WARN text_generation_launcher: Unable to use Flash Attention V2: GPU with CUDA capability 7 5 is not supported for Flash Attention V2
2023-12-16T09:36:20.872033Z WARN text_generation_launcher: Could not import Mistral model: Mistral model requires flash attn v2
2023-12-16T09:36:20.872711Z WARN text_generation_launcher: Could not import Mixtral model: Mistral model requires flash attn v2
2023-12-16T09:36:23.274668Z ERROR text_generation_launcher: Error when initializing model
Traceback (most recent call last):
File "/opt/conda/bin/text-generation-server", line 8, in <module>
sys.exit(app())
File "/opt/conda/lib/python3.10/site-packages/typer/main.py", line 311, in __call__
return get_command(self)(*args, **kwargs)
File "/opt/conda/lib/python3.10/site-packages/click/core.py", line 1157, in __call__
return self.main(*args, **kwargs)
File "/opt/conda/lib/python3.10/site-packages/typer/core.py", line 778, in main
return _main(
File "/opt/conda/lib/python3.10/site-packages/typer/core.py", line 216, in _main
rv = self.invoke(ctx)
File "/opt/conda/lib/python3.10/site-packages/click/core.py", line 1688, in invoke
return _process_result(sub_ctx.command.invoke(sub_ctx))
File "/opt/conda/lib/python3.10/site-packages/click/core.py", line 1434, in invoke
return ctx.invoke(self.callback, **ctx.params)
File "/opt/conda/lib/python3.10/site-packages/click/core.py", line 783, in invoke
return __callback(*args, **kwargs)
File "/opt/conda/lib/python3.10/site-packages/typer/main.py", line 683, in wrapper
return callback(**use_params) # type: ignore
File "/opt/conda/lib/python3.10/site-packages/text_generation_server/cli.py", line 89, in serve
server.serve(
File "/opt/conda/lib/python3.10/site-packages/text_generation_server/server.py", line 228, in serve
asyncio.run(
File "/opt/conda/lib/python3.10/asyncio/runners.py", line 44, in run
return loop.run_until_complete(main)
File "/opt/conda/lib/python3.10/asyncio/base_events.py", line 636, in run_until_complete
self.run_forever()
File "/opt/conda/lib/python3.10/asyncio/base_events.py", line 603, in run_forever
self._run_once()
File "/opt/conda/lib/python3.10/asyncio/base_events.py", line 1909, in _run_once
handle._run()
File "/opt/conda/lib/python3.10/asyncio/events.py", line 80, in _run
self._context.run(self._callback, *self._args)
> File "/opt/conda/lib/python3.10/site-packages/text_generation_server/server.py", line 174, in serve_inner
model = get_model(
File "/opt/conda/lib/python3.10/site-packages/text_generation_server/models/__init__.py", line 271, in get_model
return FlashRWSharded(
File "/opt/conda/lib/python3.10/site-packages/text_generation_server/models/flash_rw.py", line 67, in __init__
model = FlashRWForCausalLM(config, weights)
File "/opt/conda/lib/python3.10/site-packages/text_generation_server/models/custom_modeling/flash_rw_modeling.py", line 614, in __init__
self.transformer = FlashRWModel(config, weights)
File "/opt/conda/lib/python3.10/site-packages/text_generation_server/models/custom_modeling/flash_rw_modeling.py", line 556, in __init__
[
File "/opt/conda/lib/python3.10/site-packages/text_generation_server/models/custom_modeling/flash_rw_modeling.py", line 557, in <listcomp>
FlashRWLayer(layer_id, config, weights)
File "/opt/conda/lib/python3.10/site-packages/text_generation_server/models/custom_modeling/flash_rw_modeling.py", line 385, in __init__
self.self_attention = FlashRWAttention(
File "/opt/conda/lib/python3.10/site-packages/text_generation_server/models/custom_modeling/flash_rw_modeling.py", line 138, in __init__
raise ValueError(
ValueError: `num_heads` must be divisible by `num_shards` (got `num_heads`: 71 and `num_shards`: 2
2023-12-16T09:36:23.280687Z ERROR text_generation_launcher: Error when initializing model
Traceback (most recent call last):
File "/opt/conda/bin/text-generation-server", line 8, in <module>
sys.exit(app())
File "/opt/conda/lib/python3.10/site-packages/typer/main.py", line 311, in __call__
return get_command(self)(*args, **kwargs)
File "/opt/conda/lib/python3.10/site-packages/click/core.py", line 1157, in __call__
return self.main(*args, **kwargs)
File "/opt/conda/lib/python3.10/site-packages/typer/core.py", line 778, in main
return _main(
File "/opt/conda/lib/python3.10/site-packages/typer/core.py", line 216, in _main
rv = self.invoke(ctx)
File "/opt/conda/lib/python3.10/site-packages/click/core.py", line 1688, in invoke
return _process_result(sub_ctx.command.invoke(sub_ctx))
File "/opt/conda/lib/python3.10/site-packages/click/core.py", line 1434, in invoke
return ctx.invoke(self.callback, **ctx.params)
File "/opt/conda/lib/python3.10/site-packages/click/core.py", line 783, in invoke
return __callback(*args, **kwargs)
File "/opt/conda/lib/python3.10/site-packages/typer/main.py", line 683, in wrapper
return callback(**use_params) # type: ignore
File "/opt/conda/lib/python3.10/site-packages/text_generation_server/cli.py", line 89, in serve
server.serve(
File "/opt/conda/lib/python3.10/site-packages/text_generation_server/server.py", line 228, in serve
asyncio.run(
File "/opt/conda/lib/python3.10/asyncio/runners.py", line 44, in run
return loop.run_until_complete(main)
File "/opt/conda/lib/python3.10/asyncio/base_events.py", line 636, in run_until_complete
self.run_forever()
File "/opt/conda/lib/python3.10/asyncio/base_events.py", line 603, in run_forever
self._run_once()
File "/opt/conda/lib/python3.10/asyncio/base_events.py", line 1909, in _run_once
handle._run()
File "/opt/conda/lib/python3.10/asyncio/events.py", line 80, in _run
self._context.run(self._callback, *self._args)
> File "/opt/conda/lib/python3.10/site-packages/text_generation_server/server.py", line 174, in serve_inner
model = get_model(
File "/opt/conda/lib/python3.10/site-packages/text_generation_server/models/__init__.py", line 271, in get_model
return FlashRWSharded(
File "/opt/conda/lib/python3.10/site-packages/text_generation_server/models/flash_rw.py", line 67, in __init__
model = FlashRWForCausalLM(config, weights)
File "/opt/conda/lib/python3.10/site-packages/text_generation_server/models/custom_modeling/flash_rw_modeling.py", line 614, in __init__
self.transformer = FlashRWModel(config, weights)
File "/opt/conda/lib/python3.10/site-packages/text_generation_server/models/custom_modeling/flash_rw_modeling.py", line 556, in __init__
[
File "/opt/conda/lib/python3.10/site-packages/text_generation_server/models/custom_modeling/flash_rw_modeling.py", line 557, in <listcomp>
FlashRWLayer(layer_id, config, weights)
File "/opt/conda/lib/python3.10/site-packages/text_generation_server/models/custom_modeling/flash_rw_modeling.py", line 385, in __init__
self.self_attention = FlashRWAttention(
File "/opt/conda/lib/python3.10/site-packages/text_generation_server/models/custom_modeling/flash_rw_modeling.py", line 138, in __init__
raise ValueError(
ValueError: `num_heads` must be divisible by `num_shards` (got `num_heads`: 71 and `num_shards`: 2
2023-12-16T09:36:24.532439Z ERROR shard-manager: text_generation_launcher: Shard complete standard error output:
You are using a model of type falcon to instantiate a model of type . This is not supported for all configurations of models and can yield errors.
Traceback (most recent call last):
File "/opt/conda/bin/text-generation-server", line 8, in <module>
sys.exit(app())
File "/opt/conda/lib/python3.10/site-packages/text_generation_server/cli.py", line 89, in serve
server.serve(
File "/opt/conda/lib/python3.10/site-packages/text_generation_server/server.py", line 228, in serve
asyncio.run(
File "/opt/conda/lib/python3.10/asyncio/runners.py", line 44, in run
return loop.run_until_complete(main)
File "/opt/conda/lib/python3.10/asyncio/base_events.py", line 649, in run_until_complete
return future.result()
File "/opt/conda/lib/python3.10/site-packages/text_generation_server/server.py", line 174, in serve_inner
model = get_model(
File "/opt/conda/lib/python3.10/site-packages/text_generation_server/models/__init__.py", line 271, in get_model
return FlashRWSharded(
File "/opt/conda/lib/python3.10/site-packages/text_generation_server/models/flash_rw.py", line 67, in __init__
model = FlashRWForCausalLM(config, weights)
File "/opt/conda/lib/python3.10/site-packages/text_generation_server/models/custom_modeling/flash_rw_modeling.py", line 614, in __init__
self.transformer = FlashRWModel(config, weights)
File "/opt/conda/lib/python3.10/site-packages/text_generation_server/models/custom_modeling/flash_rw_modeling.py", line 556, in __init__
[
File "/opt/conda/lib/python3.10/site-packages/text_generation_server/models/custom_modeling/flash_rw_modeling.py", line 557, in <listcomp>
FlashRWLayer(layer_id, config, weights)
File "/opt/conda/lib/python3.10/site-packages/text_generation_server/models/custom_modeling/flash_rw_modeling.py", line 385, in __init__
self.self_attention = FlashRWAttention(
File "/opt/conda/lib/python3.10/site-packages/text_generation_server/models/custom_modeling/flash_rw_modeling.py", line 138, in __init__
raise ValueError(
ValueError: `num_heads` must be divisible by `num_shards` (got `num_heads`: 71 and `num_shards`: 2
rank=0
2023-12-16T09:36:24.632149Z ERROR text_generation_launcher: Shard 0 failed to start
2023-12-16T09:36:24.632181Z INFO text_generation_launcher: Shutting down shards
2023-12-16T09:36:24.652173Z INFO shard-manager: text_generation_launcher: Shard terminated rank=1
Error: ShardCannotStart
I finally got it to run with this command.
sudo docker run --gpus all --shm-size 1g -p 8080:80 -v /home/deeznnutz/discourse/data:/data ghcr.io/huggingface/text-generation-inference:1.3 --model-id tiiuae/falcon-7b-instruct --max-batch-prefill-tokens 2048
I also found this resource to help me fine tune this with the Tesla T4.
https://github.com/huggingface/text-generation-inference/issues/629
System Info
Server is a self hosted supermicro server with (2) Tesla T4s.
MemTotal: 196674144 kB Intel(R) Xeon(R) Silver 4210 CPU @ 2.20GHz Ubuntu Server 22.04
Information
Tasks
Reproduction
Source: https://huggingface.co/docs/text-generation-inference/quicktour
Expected behavior
I expect the model to run in docker. However, I see the following errors. I see in the documentation that this model support the Tesla T4 GPU which I'm running here.
When I monitor the graphics card when I start the docker image, I can see memory usage growing until it consumes all of the memory on both cards.
How can I adjust the docker command to run this model without exhausting the memory on the GPU?