Errors when serving Taiwan LLama3 model

I tried to use TGI to serve the model (yentinglin/Llama-3-Taiwan-8B-Instruct-128k), but I got following errors. Any comment will be appreciated.

2024-07-02T01:16:56.299481Z INFO text_generation_launcher: Args {
model_id: "yentinglin/Llama-3-Taiwan-8B-Instruct-128k",
revision: None,
validation_workers: 2,
sharded: None,
num_shard: None,
quantize: Some(
Bitsandbytes,
),
speculate: None,
dtype: None,
trust_remote_code: false,
max_concurrent_requests: 128,
max_best_of: 2,
max_stop_sequences: 4,
max_top_n_tokens: 5,
max_input_tokens: None,
max_input_length: Some(
96000,
),
max_total_tokens: Some(
128000,
),
waiting_served_ratio: 0.3,
max_batch_prefill_tokens: Some(
96000,
),
max_batch_total_tokens: None,
max_waiting_tokens: 20,
max_batch_size: None,
cuda_graphs: None,
hostname: "423c2839e036",
port: 80,
shard_uds_path: "/tmp/text-generation-server",
master_addr: "localhost",
master_port: 29500,
huggingface_hub_cache: Some(
"/data",
),
weights_cache_override: None,
disable_custom_kernels: false,
cuda_memory_fraction: 1.0,
rope_scaling: None,
rope_factor: None,
json_output: false,
otlp_endpoint: None,
otlp_service_name: "text-generation-inference.router",
cors_allow_origin: [],
watermark_gamma: None,
watermark_delta: None,
ngrok: false,
ngrok_authtoken: None,
ngrok_edge: None,
tokenizer_config_path: None,
disable_grammar_support: false,
env: false,
max_client_batch_size: 4,
}
2024-07-02T01:16:56.299797Z INFO hf_hub: Token file not found "/root/.cache/huggingface/token"
2024-07-02T01:16:56.710693Z INFO text_generation_launcher: Bitsandbytes doesn't work with cuda graphs, deactivating them
2024-07-02T01:16:56.711086Z INFO download: text_generation_launcher: Starting download process.
2024-07-02T01:17:03.588211Z INFO text_generation_launcher: Download file: model-00001-of-00004.safetensors
2024-07-02T01:19:13.452334Z INFO text_generation_launcher: Downloaded /data/models--yentinglin--Llama-3-Taiwan-8B-Instruct-128k/snapshots/9b899f970a1b613c7b0516d4674e5003f467faff/model-00001-of-00004.safetensors in 0:02:09.
2024-07-02T01:19:13.452806Z INFO text_generation_launcher: Download: [1/5] -- ETA: 0:08:36
2024-07-02T01:19:13.454338Z INFO text_generation_launcher: Download file: model-00002-of-00004.safetensors
2024-07-02T01:21:25.292526Z INFO text_generation_launcher: Downloaded /data/models--yentinglin--Llama-3-Taiwan-8B-Instruct-128k/snapshots/9b899f970a1b613c7b0516d4674e5003f467faff/model-00002-of-00004.safetensors in 0:02:11.
2024-07-02T01:21:25.292664Z INFO text_generation_launcher: Download: [2/5] -- ETA: 0:06:31.500000
2024-07-02T01:21:25.293379Z INFO text_generation_launcher: Download file: model-00003-of-00004.safetensors
2024-07-02T01:23:39.538011Z INFO text_generation_launcher: Downloaded /data/models--yentinglin--Llama-3-Taiwan-8B-Instruct-128k/snapshots/9b899f970a1b613c7b0516d4674e5003f467faff/model-00003-of-00004.safetensors in 0:02:14.
2024-07-02T01:23:39.538489Z INFO text_generation_launcher: Download: [3/5] -- ETA: 0:04:23.333334
2024-07-02T01:23:39.540114Z INFO text_generation_launcher: Download file: model-00004-of-00004.safetensors
2024-07-02T01:24:11.750936Z INFO text_generation_launcher: Downloaded /data/models--yentinglin--Llama-3-Taiwan-8B-Instruct-128k/snapshots/9b899f970a1b613c7b0516d4674e5003f467faff/model-00004-of-00004.safetensors in 0:00:32.
2024-07-02T01:24:11.751393Z INFO text_generation_launcher: Download: [4/5] -- ETA: 0:01:47
2024-07-02T01:24:11.752106Z INFO text_generation_launcher: Download file: model.safetensors
5003f467faff/model.safetensors in 0:00:00.
2024-07-02T01:24:12.290813Z INFO text_generation_launcher: Download: [5/5] -- ETA: 0
2024-07-02T01:24:13.297441Z INFO download: text_generation_launcher: Successfully downloaded weights.
2024-07-02T01:24:13.297892Z INFO shard-manager: text_generation_launcher: Starting shard rank=0
2024-07-02T01:24:18.628661Z INFO text_generation_launcher: Detected system cuda
2024-07-02T01:24:23.315694Z INFO shard-manager: text_generation_launcher: Waiting for shard to be ready... rank=0
2024-07-02T01:24:24.011563Z ERROR text_generation_launcher: Error when initializing model
Traceback (most recent call last):
File "/opt/conda/bin/text-generation-server", line 8, in
sys.exit(app())
File "/opt/conda/lib/python3.10/site-packages/typer/main.py", line 311, in call
return get_command(self)(*args, **kwargs)
File "/opt/conda/lib/python3.10/site-packages/click/core.py", line 1157, in call
return self.main(*args, **kwargs)
File "/opt/conda/lib/python3.10/site-packages/typer/core.py", line 778, in main
return _main(
File "/opt/conda/lib/python3.10/site-packages/typer/core.py", line 216, in _main
rv = self.invoke(ctx)
File "/opt/conda/lib/python3.10/site-packages/click/core.py", line 1688, in invoke
return _process_result(sub_ctx.command.invoke(sub_ctx))
File "/opt/conda/lib/python3.10/site-packages/click/core.py", line 1434, in invoke
return ctx.invoke(self.callback, **ctx.params)
File "/opt/conda/lib/python3.10/site-packages/click/core.py", line 783, in invoke
return __callback(*args, **kwargs)
File "/opt/conda/lib/python3.10/site-packages/typer/main.py", line 683, in wrapper
return callback(**use_params) # type: ignore
File "/opt/conda/lib/python3.10/site-packages/text_generation_server/cli.py", line 94, in serve
server.serve(
File "/opt/conda/lib/python3.10/site-packages/text_generation_server/server.py", line 267, in serve
asyncio.run(
File "/opt/conda/lib/python3.10/asyncio/runners.py", line 44, in run
return loop.run_until_complete(main)
File "/opt/conda/lib/python3.10/asyncio/base_events.py", line 636, in run_until_complete
self.run_forever()
File "/opt/conda/lib/python3.10/asyncio/base_events.py", line 603, in run_forever
self._run_once()
File "/opt/conda/lib/python3.10/asyncio/base_events.py", line 1909, in _run_once
handle._run()
File "/opt/conda/lib/python3.10/asyncio/events.py", line 80, in _run
self._context.run(self._callback, *self._args)

File "/opt/conda/lib/python3.10/site-packages/text_generation_server/server.py", line 225, in serve_inner
model = get_model(
File "/opt/conda/lib/python3.10/site-packages/text_generation_server/models/init.py", line 591, in get_model
return FlashLlama(
File "/opt/conda/lib/python3.10/site-packages/text_generation_server/models/flash_llama.py", line 69, in init
weights = Weights(filenames, device, dtype, process_group=self.process_group)
File "/opt/conda/lib/python3.10/site-packages/text_generation_server/utils/weights.py", line 37, in init
raise RuntimeError(
RuntimeError: Key lm_head.weight was found in multiple files: /data/models--yentinglin--Llama-3-Taiwan-8B-Instruct-128k/snapshots/9b899f970a1b613c7b0516d4674e5003f467faff/model.safetensors and /data/models--yentinglin--Llama-3-Taiwan-8B-Instruct-128k/snapshots/9b899f970a1b613c7b0516d4674e5003f467faff/model-00004-of-00004.safetensors
2024-07-02T01:24:25.118582Z ERROR shard-manager: text_generation_launcher: Shard complete standard error output:

/opt/conda/lib/python3.10/site-packages/huggingface_hub/file_download.py:1132: FutureWarning: resume_download is deprecated and will be removed in version 1.0.0. Downloads always resume when possible. If you want to force a new download, use force_download=True.
warnings.warn(
Special tokens have been added in the vocabulary, make sure the associated word embeddings are fine-tuned or trained.
/opt/conda/lib/python3.10/site-packages/torch/distributed/distributed_c10d.py:658: UserWarning: You are using a Backend <class 'text_generation_server.utils.dist.FakeGroup'> as a ProcessGroup. This usage is deprecated since PyTorch 2.0. Please use a public API of PyTorch Distributed instead.
warnings.warn(
Traceback (most recent call last):

File "/opt/conda/bin/text-generation-server", line 8, in
sys.exit(app())

File "/opt/conda/lib/python3.10/site-packages/text_generation_server/cli.py", line 94, in serve
server.serve(

File "/opt/conda/lib/python3.10/site-packages/text_generation_server/server.py", line 267, in serve
asyncio.run(

File "/opt/conda/lib/python3.10/asyncio/runners.py", line 44, in run
return loop.run_until_complete(main)

File "/opt/conda/lib/python3.10/asyncio/base_events.py", line 649, in run_until_complete
return future.result()

File "/opt/conda/lib/python3.10/site-packages/text_generation_server/server.py", line 225, in serve_inner
model = get_model(

File "/opt/conda/lib/python3.10/site-packages/text_generation_server/models/init.py", line 591, in get_model
return FlashLlama(

File "/opt/conda/lib/python3.10/site-packages/text_generation_server/models/flash_llama.py", line 69, in init
weights = Weights(filenames, device, dtype, process_group=self.process_group)

File "/opt/conda/lib/python3.10/site-packages/text_generation_server/utils/weights.py", line 37, in init
raise RuntimeError(

RuntimeError: Key lm_head.weight was found in multiple files: /data/models--yentinglin--Llama-3-Taiwan-8B-Instruct-128k/snapshots/9b899f970a1b613c7b0516d4674e5003f467faff/model.safetensors and /data/models--yentinglin--Llama-3-Taiwan-8B-Instruct-128k/snapshots/9b899f970a1b613c7b0516d4674e5003f467faff/model-00004-of-00004.safetensors
rank=0
2024-07-02T01:24:25.214480Z ERROR text_generation_launcher: Shard 0 failed to start
2024-07-02T01:24:25.214519Z INFO text_generation_launcher: Shutting down shards
Error: ShardCannotStart

This is the command to start the TGI

model=yentinglin/Llama-3-Taiwan-8B-Instruct-128k
volume=$PWD/data # share a volume with the Docker container to avoid downloading weights every run

docker run -e HF_TOKEN='hf_xxxx' --gpus '"device=1"' --shm-size 1g -p 8081:80 -v $volume:/data --name Llama-3-Taiwan-8B-Instruct-128k ghcr.io/huggingface/text-generation-inference:latest --model-id $model --quantize bitsandbytes --max-input-length=96000 --max-total-tokens=128000 --max-batch-prefill-tokens 96000

huggingface / text-generation-inference

Errors when serving Taiwan LLama3 model #2159