2024-09-25T14:29:44.260191Z INFO text_generation_launcher: Args {
model_id: "meta-llama/Meta-Llama-3.1-405B-Instruct-fp8",
revision: None,
validation_workers: 2,
sharded: None,
num_shard: Some(
8,
),
quantize: None,
speculate: None,
dtype: None,
trust_remote_code: false,
max_concurrent_requests: 128,
max_best_of: 2,
max_stop_sequences: 4,
max_top_n_tokens: 5,
max_input_tokens: None,
max_input_length: Some(
500,
),
max_total_tokens: Some(
13107,
),
waiting_served_ratio: 0.3,
max_batch_prefill_tokens: Some(
550,
),
max_batch_total_tokens: None,
max_waiting_tokens: 20,
max_batch_size: None,
cuda_graphs: None,
hostname: "tgi-llama-6dfd4d944f-vmdkw",
port: 80,
shard_uds_path: "/tmp/text-generation-server",
master_addr: "localhost",
master_port: 29500,
huggingface_hub_cache: Some(
"/data",
),
weights_cache_override: None,
disable_custom_kernels: false,
cuda_memory_fraction: 1.0,
rope_scaling: None,
rope_factor: None,
json_output: false,
otlp_endpoint: None,
otlp_service_name: "text-generation-inference.router",
cors_allow_origin: [],
watermark_gamma: None,
watermark_delta: None,
ngrok: false,
ngrok_authtoken: None,
ngrok_edge: None,
tokenizer_config_path: None,
disable_grammar_support: false,
env: true,
max_client_batch_size: 4,
lora_adapters: None,
disable_usage_stats: false,
disable_crash_reports: false,
}
2024-09-25T14:29:44.260260Z INFO hf_hub: Token file not found "/root/.cache/huggingface/token"
2024-09-25T14:29:44.441323Z INFO text_generation_launcher: Using default cuda graphs [1, 2, 4, 8, 16, 32]
2024-09-25T14:29:44.441331Z INFO text_generation_launcher: Sharding model on 8 processes
2024-09-25T14:29:44.441452Z INFO download: text_generation_launcher: Starting check and download process for meta-llama/Meta-Llama-3.1-405B-Instruct-fp8
2024-09-25T15:00:51.799015Z INFO download: text_generation_launcher: Successfully downloaded weights for meta-llama/Meta-Llama-3.1-405B-Instruct-fp8
2024-09-25T15:00:51.799235Z INFO shard-manager: text_generation_launcher: Starting shard rank=0
2024-09-25T15:00:51.799251Z INFO shard-manager: text_generation_launcher: Starting shard rank=1
2024-09-25T15:00:51.799601Z INFO shard-manager: text_generation_launcher: Starting shard rank=2
2024-09-25T15:00:51.800066Z INFO shard-manager: text_generation_launcher: Starting shard rank=3
2024-09-25T15:00:51.800097Z INFO shard-manager: text_generation_launcher: Starting shard rank=4
2024-09-25T15:00:51.801546Z INFO shard-manager: text_generation_launcher: Starting shard rank=5
2024-09-25T15:00:51.801585Z INFO shard-manager: text_generation_launcher: Starting shard rank=6
2024-09-25T15:00:51.802622Z INFO shard-manager: text_generation_launcher: Starting shard rank=7
2024-09-25T15:00:56.515337Z INFO text_generation_launcher: Auto selecting quantization method fp8
2024-09-25T15:01:01.806057Z INFO shard-manager: text_generation_launcher: Waiting for shard to be ready... rank=0
2024-09-25T15:01:01.807285Z INFO shard-manager: text_generation_launcher: Waiting for shard to be ready... rank=1
2024-09-25T15:01:01.807322Z INFO shard-manager: text_generation_launcher: Waiting for shard to be ready... rank=2
2024-09-25T15:01:01.807360Z INFO shard-manager: text_generation_launcher: Waiting for shard to be ready... rank=4
2024-09-25T15:01:01.808804Z INFO shard-manager: text_generation_launcher: Waiting for shard to be ready... rank=3
2024-09-25T15:01:01.809297Z INFO shard-manager: text_generation_launcher: Waiting for shard to be ready... rank=6
2024-09-25T15:01:01.809605Z INFO shard-manager: text_generation_launcher: Waiting for shard to be ready... rank=7
2024-09-25T15:01:01.814302Z INFO shard-manager: text_generation_launcher: Waiting for shard to be ready... rank=5
2024-09-25T15:01:05.514208Z INFO text_generation_launcher: Using FBGEMM fp8 optimized kernels
2024-09-25T15:04:30.363596Z INFO text_generation_launcher: Server started at unix:///tmp/text-generation-server-2
2024-09-25T15:04:30.371516Z INFO text_generation_launcher: Server started at unix:///tmp/text-generation-server-3
2024-09-25T15:04:30.372803Z INFO text_generation_launcher: Server started at unix:///tmp/text-generation-server-4
2024-09-25T15:04:30.372919Z INFO text_generation_launcher: Server started at unix:///tmp/text-generation-server-5
2024-09-25T15:04:30.372927Z INFO text_generation_launcher: Server started at unix:///tmp/text-generation-server-7
2024-09-25T15:04:30.373540Z INFO text_generation_launcher: Server started at unix:///tmp/text-generation-server-0
2024-09-25T15:04:30.373927Z INFO text_generation_launcher: Server started at unix:///tmp/text-generation-server-1
2024-09-25T15:04:30.420621Z INFO shard-manager: text_generation_launcher: Shard ready in 218.618910525s rank=4
2024-09-25T15:04:30.426690Z INFO shard-manager: text_generation_launcher: Shard ready in 218.622944116s rank=7
2024-09-25T15:04:30.427452Z INFO shard-manager: text_generation_launcher: Shard ready in 218.62400201s rank=5
2024-09-25T15:04:30.444388Z INFO shard-manager: text_generation_launcher: Shard ready in 218.644204722s rank=0
2024-09-25T15:04:30.460515Z INFO shard-manager: text_generation_launcher: Shard ready in 218.658884257s rank=2
2024-09-25T15:04:30.460530Z INFO shard-manager: text_generation_launcher: Shard ready in 218.658891373s rank=1
2024-09-25T15:04:30.460532Z INFO shard-manager: text_generation_launcher: Shard ready in 218.657400525s rank=3
2024-09-25T15:04:30.556841Z INFO text_generation_launcher: Starting Webserver
2024-09-25T15:04:30.664794Z INFO text_generation_router: router/src/main.rs:228: Using the Hugging Face API
2024-09-25T15:04:30.664836Z INFO hf_hub: /usr/local/cargo/registry/src/index.crates.io-6f17d22bba15001f/hf-hub-0.3.2/src/lib.rs:55: Token file not found "/root/.cache/huggingface/token"
2024-09-25T15:04:31.378511Z INFO text_generation_router: router/src/main.rs:577: Serving revision 2147c7e74f1bf338ad11843e450ee174df547589 of model meta-llama/Meta-Llama-3.1-405B-Instruct-FP8
2024-09-25T15:04:31.597861Z INFO text_generation_router: router/src/main.rs:357: Using config Some(Llama)
2024-09-25T15:04:31.597869Z WARN text_generation_router: router/src/main.rs:384: Invalid hostname, defaulting to 0.0.0.0
2024-09-25T15:04:31.851898Z INFO text_generation_router::server: router/src/server.rs:1572: Warming up model
2024-09-25T15:04:33.037820Z INFO text_generation_launcher: Cuda Graphs are enabled for sizes [32, 16, 8, 4, 2, 1]
2024-09-25T15:04:34.456876Z ERROR text_generation_launcher: Method Warmup encountered an error.
Traceback (most recent call last):
2024-09-25T15:04:34.519240Z ERROR text_generation_launcher: Method Warmup encountered an error.
Traceback (most recent call last):
File "/opt/conda/bin/text-generation-server", line 8, in <module>
sys.exit(app())
File "/opt/conda/lib/python3.10/site-packages/typer/main.py", line 311, in __call__
return get_command(self)(*args, **kwargs)
File "/opt/conda/lib/python3.10/site-packages/click/core.py", line 1157, in __call__
return self.main(*args, **kwargs)
File "/opt/conda/lib/python3.10/site-packages/typer/core.py", line 778, in main
return _main(
File "/opt/conda/lib/python3.10/site-packages/typer/core.py", line 216, in _main
rv = self.invoke(ctx)
File "/opt/conda/lib/python3.10/site-packages/click/core.py", line 1688, in invoke
return _process_result(sub_ctx.command.invoke(sub_ctx))
File "/opt/conda/lib/python3.10/site-packages/click/core.py", line 1434, in invoke
return ctx.invoke(self.callback, **ctx.params)
File "/opt/conda/lib/python3.10/site-packages/click/core.py", line 783, in invoke
return __callback(*args, **kwargs)
File "/opt/conda/lib/python3.10/site-packages/typer/main.py", line 683, in wrapper
return callback(**use_params) # type: ignore
File "/opt/conda/lib/python3.10/site-packages/text_generation_server/cli.py", line 118, in serve
server.serve(
File "/opt/conda/lib/python3.10/site-packages/text_generation_server/server.py", line 297, in serve
asyncio.run(
File "/opt/conda/lib/python3.10/asyncio/runners.py", line 44, in run
return loop.run_until_complete(main)
File "/opt/conda/lib/python3.10/asyncio/base_events.py", line 636, in run_until_complete
self.run_forever()
File "/opt/conda/lib/python3.10/asyncio/base_events.py", line 603, in run_forever
self._run_once()
File "/opt/conda/lib/python3.10/asyncio/base_events.py", line 1909, in _run_once
handle._run()
File "/opt/conda/lib/python3.10/asyncio/events.py", line 80, in _run
self._context.run(self._callback, *self._args)
File "/opt/conda/lib/python3.10/site-packages/grpc_interceptor/server.py", line 165, in invoke_intercept_method
return await self.intercept(
> File "/opt/conda/lib/python3.10/site-packages/text_generation_server/interceptor.py", line 21, in intercept
return await response
File "/opt/conda/lib/python3.10/site-packages/opentelemetry/instrumentation/grpc/_aio_server.py", line 120, in _unary_interceptor
raise error
File "/opt/conda/lib/python3.10/site-packages/opentelemetry/instrumentation/grpc/_aio_server.py", line 111, in _unary_interceptor
return await behavior(request_or_iterator, context)
File "/opt/conda/lib/python3.10/site-packages/text_generation_server/server.py", line 125, in Warmup
max_supported_total_tokens = self.model.warmup(batch)
File "/opt/conda/lib/python3.10/site-packages/text_generation_server/models/flash_causal_lm.py", line 1196, in warmup
self.cuda_graph_warmup(bs, max_s, max_bt)
File "/opt/conda/lib/python3.10/site-packages/text_generation_server/models/flash_causal_lm.py", line 1065, in cuda_graph_warmup
with torch.cuda.graph(graph, pool=MEM_POOL):
File "/opt/conda/lib/python3.10/site-packages/torch/cuda/graphs.py", line 184, in __exit__
self.cuda_graph.capture_end()
File "/opt/conda/lib/python3.10/site-packages/torch/cuda/graphs.py", line 82, in capture_end
super().capture_end()
RuntimeError: CUDA error: out of memory
CUDA kernel errors might be asynchronously reported at some other API call, so the stacktrace below might be incorrect.
For debugging consider passing CUDA_LAUNCH_BLOCKING=1.
Compile with `TORCH_USE_CUDA_DSA` to enable device-side assertions.
2024-09-25T15:04:34.598137Z ERROR warmup{max_input_length=500 max_prefill_tokens=550 max_total_tokens=13107 max_batch_size=None}:warmup: text_generation_client: router/client/src/lib.rs:46: Server error: CANCELLED
2024-09-25T15:04:34.617895Z ERROR warmup{max_input_length=500 max_prefill_tokens=550 max_total_tokens=13107 max_batch_size=None}:warmup: text_generation_client: router/client/src/lib.rs:46: Server error: CANCELLED
2024-09-25T15:04:34.650181Z ERROR warmup{max_input_length=500 max_prefill_tokens=550 max_total_tokens=13107 max_batch_size=None}:warmup: text_generation_client: router/client/src/lib.rs:46: Server error: CANCELLED
2024-09-25T15:04:34.677632Z ERROR warmup{max_input_length=500 max_prefill_tokens=550 max_total_tokens=13107 max_batch_size=None}:warmup: text_generation_client: router/client/src/lib.rs:46: Server error: CANCELLED
2024-09-25T15:04:34.680492Z ERROR warmup{max_input_length=500 max_prefill_tokens=550 max_total_tokens=13107 max_batch_size=None}:warmup: text_generation_client: router/client/src/lib.rs:46: Server error: CANCELLED
2024-09-25T15:04:34.701973Z ERROR warmup{max_input_length=500 max_prefill_tokens=550 max_total_tokens=13107 max_batch_size=None}:warmup: text_generation_client: router/client/src/lib.rs:46: Server error: CANCELLED
2024-09-25T15:04:34.707007Z ERROR warmup{max_input_length=500 max_prefill_tokens=550 max_total_tokens=13107 max_batch_size=None}:warmup: text_generation_client: router/client/src/lib.rs:46: Server error: CANCELLED
2024-09-25T15:04:34.713119Z ERROR warmup{max_input_length=500 max_prefill_tokens=550 max_total_tokens=13107 max_batch_size=None}:warmup: text_generation_client: router/client/src/lib.rs:46: Server error: CANCELLED
Error: WebServer(Warmup(Generation("CANCELLED")))
2024-09-25T15:04:34.954646Z ERROR text_generation_launcher: Webserver Crashed
2024-09-25T15:04:34.954664Z INFO text_generation_launcher: Shutting down shards
2024-09-25T15:04:34.963134Z INFO shard-manager: text_generation_launcher: Terminating shard rank=2
2024-09-25T15:04:34.963148Z INFO shard-manager: text_generation_launcher: Terminating shard rank=3
2024-09-25T15:04:34.963165Z INFO shard-manager: text_generation_launcher: Terminating shard rank=1
2024-09-25T15:04:34.964271Z INFO shard-manager: text_generation_launcher: Waiting for shard to gracefully shutdown rank=2
2024-09-25T15:04:34.964340Z INFO shard-manager: text_generation_launcher: Waiting for shard to gracefully shutdown rank=3
2024-09-25T15:04:34.964421Z INFO shard-manager: text_generation_launcher: Waiting for shard to gracefully shutdown rank=1
2024-09-25T15:04:35.023355Z INFO shard-manager: text_generation_launcher: Terminating shard rank=4
2024-09-25T15:04:35.024172Z INFO shard-manager: text_generation_launcher: Waiting for shard to gracefully shutdown rank=4
2024-09-25T15:04:35.029462Z INFO shard-manager: text_generation_launcher: Terminating shard rank=7
2024-09-25T15:04:35.030347Z INFO shard-manager: text_generation_launcher: Waiting for shard to gracefully shutdown rank=7
2024-09-25T15:04:35.030945Z INFO shard-manager: text_generation_launcher: Terminating shard rank=6
2024-09-25T15:04:35.032281Z INFO shard-manager: text_generation_launcher: Waiting for shard to gracefully shutdown rank=6
2024-09-25T15:04:35.032512Z INFO shard-manager: text_generation_launcher: Terminating shard rank=5
2024-09-25T15:04:35.034027Z INFO shard-manager: text_generation_launcher: Waiting for shard to gracefully shutdown rank=5
2024-09-25T15:04:35.047083Z INFO shard-manager: text_generation_launcher: Terminating shard rank=0
2024-09-25T15:04:35.047903Z INFO shard-manager: text_generation_launcher: Waiting for shard to gracefully shutdown rank=0
2024-09-25T15:04:35.364752Z INFO shard-manager: text_generation_launcher: shard terminated rank=3
2024-09-25T15:04:35.465564Z INFO shard-manager: text_generation_launcher: shard terminated rank=1
2024-09-25T15:04:35.764901Z INFO shard-manager: text_generation_launcher: shard terminated rank=2
2024-09-25T15:04:35.931027Z INFO shard-manager: text_generation_launcher: shard terminated rank=7
2024-09-25T15:04:36.024913Z INFO shard-manager: text_generation_launcher: shard terminated rank=4
2024-09-25T15:04:36.248767Z INFO shard-manager: text_generation_launcher: shard terminated rank=0
2024-09-25T15:04:36.333451Z INFO shard-manager: text_generation_launcher: shard terminated rank=6
2024-09-25T15:04:36.635381Z INFO shard-manager: text_generation_launcher: shard terminated rank=5
Error: WebserverFailed
Expected behavior
Meta-Llama-3.1-405B-Instruct-fp8 starts with at least 10k token.
I'm aware that there are reported problems with llama3.1 to run with full context 128k, but I can't even go with 500 due to OOM error.
Meta-Llama-3.1-405B-Instruct-fp8 requires 400 GPU RAM to start the model and my chine contains totally 640 so I thought it should be sufficient value.
System Info
TGI version: 2.2.0 (but I tested 2.3.0 too) Machine: 8x H100 (640 GPU RAM)
Information
Tasks
Reproduction
Expected behavior
Meta-Llama-3.1-405B-Instruct-fp8 starts with at least 10k token. I'm aware that there are reported problems with llama3.1 to run with full context 128k, but I can't even go with 500 due to OOM error.
Meta-Llama-3.1-405B-Instruct-fp8 requires 400 GPU RAM to start the model and my chine contains totally 640 so I thought it should be sufficient value.