Closed chumpblocckami closed 1 year ago
Same issue here with flan-t5-xl
.
I am using v0.9.1 on EKS.
Full startup log below:
{"timestamp":"2023-07-06T19:14:42.852088Z","level":"INFO","fields":{"message":"Args { model_id: \"google/flan-t5-xl\", revision: None, sharded: None, num_shard: Some(1), quantize: Some(Bitsandbytes), dtype: None, trust_remote_code: false, max_concurrent_requests: 128, max_best_of: 2, max_stop_sequences: 4, max_input_length: 1024, max_total_tokens: 2048, waiting_served_ratio: 1.2, max_batch_prefill_tokens: 4096, max_batch_total_tokens: 16000, max_waiting_tokens: 20, hostname: \"flan-t5-xl-64959c4d74-qs64q\", port: 80, shard_uds_path: \"/tmp/text-generation-server\", master_addr: \"localhost\", master_port: 29500, huggingface_hub_cache: Some(\"/data\"), weights_cache_override: None, disable_custom_kernels: false, json_output: true, otlp_endpoint: None, cors_allow_origin: [], watermark_gamma: None, watermark_delta: None, ngrok: false, ngrok_authtoken: None, ngrok_domain: None, ngrok_username: None, ngrok_password: None, env: false }"},"target":"text_generation_launcher"}
{"timestamp":"2023-07-06T19:14:42.852211Z","level":"INFO","fields":{"message":"Starting download process."},"target":"text_generation_launcher"}
{"timestamp":"2023-07-06T19:14:44.549952Z","level":"WARN","fields":{"message":"No safetensors weights found for model google/flan-t5-xl at revision None. Downloading PyTorch weights.\n"},"target":"text_generation_launcher","span":{"name":"download"},"spans":[{"name":"download"}]}
{"timestamp":"2023-07-06T19:14:44.604291Z","level":"INFO","fields":{"message":"Download file: pytorch_model-00001-of-00002.bin\n"},"target":"text_generation_launcher","span":{"name":"download"},"spans":[{"name":"download"}]}
{"timestamp":"2023-07-06T19:15:03.211958Z","level":"INFO","fields":{"message":"Downloaded /data/models--google--flan-t5-xl/snapshots/53fd1e22aa944eee1fd336f9aee8a437e01676ce/pytorch_model-00001-of-00002.bin in 0:00:18.\n"},"target":"text_generation_launcher","span":{"name":"download"},"spans":[{"name":"download"}]}
{"timestamp":"2023-07-06T19:15:03.212049Z","level":"INFO","fields":{"message":"Download: [1/2] -- ETA: 0:00:18\n"},"target":"text_generation_launcher","span":{"name":"download"},"spans":[{"name":"download"}]}
{"timestamp":"2023-07-06T19:15:03.212305Z","level":"INFO","fields":{"message":"Download file: pytorch_model-00002-of-00002.bin\n"},"target":"text_generation_launcher","span":{"name":"download"},"spans":[{"name":"download"}]}
{"timestamp":"2023-07-06T19:15:09.970978Z","level":"INFO","fields":{"message":"Downloaded /data/models--google--flan-t5-xl/snapshots/53fd1e22aa944eee1fd336f9aee8a437e01676ce/pytorch_model-00002-of-00002.bin in 0:00:06.\n"},"target":"text_generation_launcher","span":{"name":"download"},"spans":[{"name":"download"}]}
{"timestamp":"2023-07-06T19:15:09.971051Z","level":"INFO","fields":{"message":"Download: [2/2] -- ETA: 0\n"},"target":"text_generation_launcher","span":{"name":"download"},"spans":[{"name":"download"}]}
{"timestamp":"2023-07-06T19:15:09.971146Z","level":"WARN","fields":{"message":"No safetensors weights found for model google/flan-t5-xl at revision None. Converting PyTorch weights to safetensors.\n"},"target":"text_generation_launcher","span":{"name":"download"},"spans":[{"name":"download"}]}
{"timestamp":"2023-07-06T19:16:15.141453Z","level":"INFO","fields":{"message":"Convert: [1/2] -- Took: 0:01:05.169846\n"},"target":"text_generation_launcher","span":{"name":"download"},"spans":[{"name":"download"}]}
{"timestamp":"2023-07-06T19:16:24.913801Z","level":"INFO","fields":{"message":"Convert: [2/2] -- Took: 0:00:09.772093\n"},"target":"text_generation_launcher","span":{"name":"download"},"spans":[{"name":"download"}]}
{"timestamp":"2023-07-06T19:16:25.262110Z","level":"INFO","fields":{"message":"Successfully downloaded weights."},"target":"text_generation_launcher"}
{"timestamp":"2023-07-06T19:16:25.262659Z","level":"INFO","fields":{"message":"Starting shard 0"},"target":"text_generation_launcher"}
{"timestamp":"2023-07-06T19:16:29.284643Z","level":"WARN","fields":{"message":"We're not using custom kernels.\n"},"target":"text_generation_launcher","span":{"rank":0,"name":"shard-manager"},"spans":[{"rank":0,"name":"shard-manager"}]}
{"timestamp":"2023-07-06T19:16:30.102714Z","level":"ERROR","fields":{"message":"Error when initializing model\nTraceback (most recent call last):\n File \"/opt/conda/lib/python3.9/site-packages/text_generation_server/models/custom_modeling/t5_modeling.py\", line 1005, in __init__\n self.shared = TensorParallelEmbedding(prefix=\"shared\", weights=weights)\n File \"/opt/conda/lib/python3.9/site-packages/text_generation_server/utils/layers.py\", line 280, in __init__\n weight = weights.get_sharded(f\"{prefix}.weight\", dim=0)\n File \"/opt/conda/lib/python3.9/site-packages/text_generation_server/utils/weights.py\", line 73, in get_sharded\n filename, tensor_name = self.get_filename(tensor_name)\n File \"/opt/conda/lib/python3.9/site-packages/text_generation_server/utils/weights.py\", line 49, in get_filename\n raise RuntimeError(f\"weight {tensor_name} does not exist\")\nRuntimeError: weight shared.weight does not exist\n\nDuring handling of the above exception, another exception occurred:\n\nTraceback (most recent call last):\n File \"/opt/conda/bin/text-generation-server\", line 8, in <module>\n sys.exit(app())\n File \"/opt/conda/lib/python3.9/site-packages/typer/main.py\", line 311, in __call__\n return get_command(self)(*args, **kwargs)\n File \"/opt/conda/lib/python3.9/site-packages/click/core.py\", line 1130, in __call__\n return self.main(*args, **kwargs)\n File \"/opt/conda/lib/python3.9/site-packages/typer/core.py\", line 778, in main\n return _main(\n File \"/opt/conda/lib/python3.9/site-packages/typer/core.py\", line 216, in _main\n rv = self.invoke(ctx)\n File \"/opt/conda/lib/python3.9/site-packages/click/core.py\", line 1657, in invoke\n return _process_result(sub_ctx.command.invoke(sub_ctx))\n File \"/opt/conda/lib/python3.9/site-packages/click/core.py\", line 1404, in invoke\n return ctx.invoke(self.callback, **ctx.params)\n File \"/opt/conda/lib/python3.9/site-packages/click/core.py\", line 760, in invoke\n return __callback(*args, **kwargs)\n File \"/opt/conda/lib/python3.9/site-packages/typer/main.py\", line 683, in wrapper\n return callback(**use_params) # type: ignore\n File \"/opt/conda/lib/python3.9/site-packages/text_generation_server/cli.py\", line 78, in serve\n server.serve(\n File \"/opt/conda/lib/python3.9/site-packages/text_generation_server/server.py\", line 166, in serve\n asyncio.run(\n File \"/opt/conda/lib/python3.9/asyncio/runners.py\", line 44, in run\n return loop.run_until_complete(main)\n File \"/opt/conda/lib/python3.9/asyncio/base_events.py\", line 634, in run_until_complete\n self.run_forever()\n File \"/opt/conda/lib/python3.9/asyncio/base_events.py\", line 601, in run_forever\n self._run_once()\n File \"/opt/conda/lib/python3.9/asyncio/base_events.py\", line 1905, in _run_once\n handle._run()\n File \"/opt/conda/lib/python3.9/asyncio/events.py\", line 80, in _run\n self._context.run(self._callback, *self._args)\n> File \"/opt/conda/lib/python3.9/site-packages/text_generation_server/server.py\", line 133, in serve_inner\n model = get_model(\n File \"/opt/conda/lib/python3.9/site-packages/text_generation_server/models/__init__.py\", line 279, in get_model\n return T5Sharded(\n File \"/opt/conda/lib/python3.9/site-packages/text_generation_server/models/t5.py\", line 61, in __init__\n model = T5ForConditionalGeneration(config, weights)\n File \"/opt/conda/lib/python3.9/site-packages/text_generation_server/models/custom_modeling/t5_modeling.py\", line 1007, in __init__\n self.shared = TensorParallelEmbedding(\n File \"/opt/conda/lib/python3.9/site-packages/text_generation_server/utils/layers.py\", line 280, in __init__\n weight = weights.get_sharded(f\"{prefix}.weight\", dim=0)\n File \"/opt/conda/lib/python3.9/site-packages/text_generation_server/utils/weights.py\", line 73, in get_sharded\n filename, tensor_name = self.get_filename(tensor_name)\n File \"/opt/conda/lib/python3.9/site-packages/text_generation_server/utils/weights.py\", line 49, in get_filename\n raise RuntimeError(f\"weight {tensor_name} does not exist\")\nRuntimeError: weight encoder.embed_tokens.weight does not exist\n"},"target":"text_generation_launcher","span":{"rank":0,"name":"shard-manager"},"spans":[{"rank":0,"name":"shard-manager"}]}
{"timestamp":"2023-07-06T19:16:30.667766Z","level":"ERROR","fields":{"message":"Shard 0 failed to start"},"target":"text_generation_launcher"}
{"timestamp":"2023-07-06T19:16:30.667795Z","level":"ERROR","fields":{"message":"Traceback (most recent call last):\n\n File \"/opt/conda/lib/python3.9/site-packages/text_generation_server/models/custom_modeling/t5_modeling.py\", line 1005, in __init__\n self.shared = TensorParallelEmbedding(prefix=\"shared\", weights=weights)\n\n File \"/opt/conda/lib/python3.9/site-packages/text_generation_server/utils/layers.py\", line 280, in __init__\n weight = weights.get_sharded(f\"{prefix}.weight\", dim=0)\n\n File \"/opt/conda/lib/python3.9/site-packages/text_generation_server/utils/weights.py\", line 73, in get_sharded\n filename, tensor_name = self.get_filename(tensor_name)\n\n File \"/opt/conda/lib/python3.9/site-packages/text_generation_server/utils/weights.py\", line 49, in get_filename\n raise RuntimeError(f\"weight {tensor_name} does not exist\")\n\nRuntimeError: weight shared.weight does not exist\n\n\nDuring handling of the above exception, another exception occurred:\n\n\nTraceback (most recent call last):\n\n File \"/opt/conda/bin/text-generation-server\", line 8, in <module>\n sys.exit(app())\n\n File \"/opt/conda/lib/python3.9/site-packages/text_generation_server/cli.py\", line 78, in serve\n server.serve(\n\n File \"/opt/conda/lib/python3.9/site-packages/text_generation_server/server.py\", line 166, in serve\n asyncio.run(\n\n File \"/opt/conda/lib/python3.9/asyncio/runners.py\", line 44, in run\n return loop.run_until_complete(main)\n\n File \"/opt/conda/lib/python3.9/asyncio/base_events.py\", line 647, in run_until_complete\n return future.result()\n\n File \"/opt/conda/lib/python3.9/site-packages/text_generation_server/server.py\", line 133, in serve_inner\n model = get_model(\n\n File \"/opt/conda/lib/python3.9/site-packages/text_generation_server/models/__init__.py\", line 279, in get_model\n return T5Sharded(\n\n File \"/opt/conda/lib/python3.9/site-packages/text_generation_server/models/t5.py\", line 61, in __init__\n model = T5ForConditionalGeneration(config, weights)\n\n File \"/opt/conda/lib/python3.9/site-packages/text_generation_server/models/custom_modeling/t5_modeling.py\", line 1007, in __init__\n self.shared = TensorParallelEmbedding(\n\n File \"/opt/conda/lib/python3.9/site-packages/text_generation_server/utils/layers.py\", line 280, in __init__\n weight = weights.get_sharded(f\"{prefix}.weight\", dim=0)\n\n File \"/opt/conda/lib/python3.9/site-packages/text_generation_server/utils/weights.py\", line 73, in get_sharded\n filename, tensor_name = self.get_filename(tensor_name)\n\n File \"/opt/conda/lib/python3.9/site-packages/text_generation_server/utils/weights.py\", line 49, in get_filename\n raise RuntimeError(f\"weight {tensor_name} does not exist\")\n\nRuntimeError: weight encoder.embed_tokens.weight does not exist\n\n"},"target":"text_generation_launcher"}
{"timestamp":"2023-07-06T19:16:30.667823Z","level":"INFO","fields":{"message":"Shutting down shards"},"target":"text_generation_launcher"}
Error: ShardCannotStart
When sharding disabled (same error, but easier to read):
2023-07-06T20:10:55.686866Z INFO text_generation_launcher: Args { model_id: "google/flan-t5-xl", revision: None, sharded: Some(false), num_shard: Some(1), quantize: Some(Bitsandbytes), dtype: None, trust_remote_code: false, max_concurrent_requests: 128, max_best_of: 2, max_stop_sequences: 4, max_input_length: 1024, max_total_tokens: 2048, waiting_served_ratio: 1.2, max_batch_prefill_tokens: 4096, max_batch_total_tokens: 16000, max_waiting_tokens: 20, port: 80, shard_uds_path: "/tmp/text-generation-server", master_addr: "localhost", master_port: 29500, huggingface_hub_cache: Some("/data"), weights_cache_override: None, disable_custom_kernels: false, json_output: false, otlp_endpoint: None, cors_allow_origin: [], watermark_gamma: None, watermark_delta: None, ngrok: false, ngrok_authtoken: None, ngrok_domain: None, ngrok_username: None, ngrok_password: None, env: false }
2023-07-06T20:10:55.686971Z INFO text_generation_launcher: Starting download process.
2023-07-06T20:10:57.341715Z WARN download: text_generation_launcher: No safetensors weights found for model google/flan-t5-xl at revision None. Downloading PyTorch weights.
2023-07-06T20:10:57.417081Z INFO download: text_generation_launcher: Download file: pytorch_model-00001-of-00002.bin
2023-07-06T20:11:18.169311Z INFO download: text_generation_launcher: Downloaded /data/models--google--flan-t5-xl/snapshots/53fd1e22aa944eee1fd336f9aee8a437e01676ce/pytorch_model-00001-of-00002.bin in 0:00:20.
2023-07-06T20:11:18.169386Z INFO download: text_generation_launcher: Download: [1/2] -- ETA: 0:00:20
2023-07-06T20:11:18.169595Z INFO download: text_generation_launcher: Download file: pytorch_model-00002-of-00002.bin
2023-07-06T20:11:25.050713Z INFO download: text_generation_launcher: Downloaded /data/models--google--flan-t5-xl/snapshots/53fd1e22aa944eee1fd336f9aee8a437e01676ce/pytorch_model-00002-of-00002.bin in 0:00:06.
2023-07-06T20:11:25.050803Z INFO download: text_generation_launcher: Download: [2/2] -- ETA: 0
2023-07-06T20:11:25.050899Z WARN download: text_generation_launcher: No safetensors weights found for model google/flan-t5-xl at revision None. Converting PyTorch weights to safetensors.
2023-07-06T20:12:30.361334Z INFO download: text_generation_launcher: Convert: [1/2] -- Took: 0:01:05.309101
2023-07-06T20:12:40.112889Z INFO download: text_generation_launcher: Convert: [2/2] -- Took: 0:00:09.752118
2023-07-06T20:12:42.517781Z INFO text_generation_launcher: Successfully downloaded weights.
2023-07-06T20:12:42.518379Z INFO text_generation_launcher: Starting shard 0
2023-07-06T20:12:50.458364Z WARN shard-manager: text_generation_launcher: We're not using custom kernels.
rank=0
2023-07-06T20:12:51.265848Z ERROR shard-manager: text_generation_launcher: Error when initializing model
Traceback (most recent call last):
File "/opt/conda/lib/python3.9/site-packages/text_generation_server/models/custom_modeling/t5_modeling.py", line 1005, in __init__
self.shared = TensorParallelEmbedding(prefix="shared", weights=weights)
File "/opt/conda/lib/python3.9/site-packages/text_generation_server/utils/layers.py", line 268, in __init__
weight = weights.get_sharded(f"{prefix}.weight", dim=0)
File "/opt/conda/lib/python3.9/site-packages/text_generation_server/utils/weights.py", line 73, in get_sharded
filename, tensor_name = self.get_filename(tensor_name)
File "/opt/conda/lib/python3.9/site-packages/text_generation_server/utils/weights.py", line 49, in get_filename
raise RuntimeError(f"weight {tensor_name} does not exist")
RuntimeError: weight shared.weight does not exist
During handling of the above exception, another exception occurred:
Traceback (most recent call last):
File "/opt/conda/bin/text-generation-server", line 8, in <module>
sys.exit(app())
File "/opt/conda/lib/python3.9/site-packages/typer/main.py", line 311, in __call__
return get_command(self)(*args, **kwargs)
File "/opt/conda/lib/python3.9/site-packages/click/core.py", line 1130, in __call__
return self.main(*args, **kwargs)
File "/opt/conda/lib/python3.9/site-packages/typer/core.py", line 778, in main
return _main(
File "/opt/conda/lib/python3.9/site-packages/typer/core.py", line 216, in _main
rv = self.invoke(ctx)
File "/opt/conda/lib/python3.9/site-packages/click/core.py", line 1657, in invoke
return _process_result(sub_ctx.command.invoke(sub_ctx))
File "/opt/conda/lib/python3.9/site-packages/click/core.py", line 1404, in invoke
return ctx.invoke(self.callback, **ctx.params)
File "/opt/conda/lib/python3.9/site-packages/click/core.py", line 760, in invoke
return __callback(*args, **kwargs)
File "/opt/conda/lib/python3.9/site-packages/typer/main.py", line 683, in wrapper
return callback(**use_params) # type: ignore
File "/opt/conda/lib/python3.9/site-packages/text_generation_server/cli.py", line 78, in serve
server.serve(
File "/opt/conda/lib/python3.9/site-packages/text_generation_server/server.py", line 166, in serve
asyncio.run(
File "/opt/conda/lib/python3.9/asyncio/runners.py", line 44, in run
return loop.run_until_complete(main)
File "/opt/conda/lib/python3.9/asyncio/base_events.py", line 634, in run_until_complete
self.run_forever()
File "/opt/conda/lib/python3.9/asyncio/base_events.py", line 601, in run_forever
self._run_once()
File "/opt/conda/lib/python3.9/asyncio/base_events.py", line 1905, in _run_once
handle._run()
File "/opt/conda/lib/python3.9/asyncio/events.py", line 80, in _run
self._context.run(self._callback, *self._args)
> File "/opt/conda/lib/python3.9/site-packages/text_generation_server/server.py", line 133, in serve_inner
model = get_model(
File "/opt/conda/lib/python3.9/site-packages/text_generation_server/models/__init__.py", line 274, in get_model
return T5Sharded(
File "/opt/conda/lib/python3.9/site-packages/text_generation_server/models/t5.py", line 61, in __init__
model = T5ForConditionalGeneration(config, weights)
File "/opt/conda/lib/python3.9/site-packages/text_generation_server/models/custom_modeling/t5_modeling.py", line 1007, in __init__
self.shared = TensorParallelEmbedding(
File "/opt/conda/lib/python3.9/site-packages/text_generation_server/utils/layers.py", line 268, in __init__
weight = weights.get_sharded(f"{prefix}.weight", dim=0)
File "/opt/conda/lib/python3.9/site-packages/text_generation_server/utils/weights.py", line 73, in get_sharded
filename, tensor_name = self.get_filename(tensor_name)
File "/opt/conda/lib/python3.9/site-packages/text_generation_server/utils/weights.py", line 49, in get_filename
raise RuntimeError(f"weight {tensor_name} does not exist")
RuntimeError: weight encoder.embed_tokens.weight does not exist
rank=0
2023-07-06T20:12:51.827130Z ERROR text_generation_launcher: Shard 0 failed to start
2023-07-06T20:12:51.827155Z ERROR text_generation_launcher: Traceback (most recent call last):
File "/opt/conda/lib/python3.9/site-packages/text_generation_server/models/custom_modeling/t5_modeling.py", line 1005, in __init__
self.shared = TensorParallelEmbedding(prefix="shared", weights=weights)
File "/opt/conda/lib/python3.9/site-packages/text_generation_server/utils/layers.py", line 268, in __init__
weight = weights.get_sharded(f"{prefix}.weight", dim=0)
File "/opt/conda/lib/python3.9/site-packages/text_generation_server/utils/weights.py", line 73, in get_sharded
filename, tensor_name = self.get_filename(tensor_name)
File "/opt/conda/lib/python3.9/site-packages/text_generation_server/utils/weights.py", line 49, in get_filename
raise RuntimeError(f"weight {tensor_name} does not exist")
RuntimeError: weight shared.weight does not exist
Error: ShardCannotStart
During handling of the above exception, another exception occurred:
Traceback (most recent call last):
File "/opt/conda/bin/text-generation-server", line 8, in <module>
sys.exit(app())
File "/opt/conda/lib/python3.9/site-packages/text_generation_server/cli.py", line 78, in serve
server.serve(
File "/opt/conda/lib/python3.9/site-packages/text_generation_server/server.py", line 166, in serve
asyncio.run(
File "/opt/conda/lib/python3.9/asyncio/runners.py", line 44, in run
return loop.run_until_complete(main)
File "/opt/conda/lib/python3.9/asyncio/base_events.py", line 647, in run_until_complete
return future.result()
File "/opt/conda/lib/python3.9/site-packages/text_generation_server/server.py", line 133, in serve_inner
model = get_model(
File "/opt/conda/lib/python3.9/site-packages/text_generation_server/models/__init__.py", line 274, in get_model
return T5Sharded(
File "/opt/conda/lib/python3.9/site-packages/text_generation_server/models/t5.py", line 61, in __init__
model = T5ForConditionalGeneration(config, weights)
File "/opt/conda/lib/python3.9/site-packages/text_generation_server/models/custom_modeling/t5_modeling.py", line 1007, in __init__
self.shared = TensorParallelEmbedding(
File "/opt/conda/lib/python3.9/site-packages/text_generation_server/utils/layers.py", line 268, in __init__
weight = weights.get_sharded(f"{prefix}.weight", dim=0)
File "/opt/conda/lib/python3.9/site-packages/text_generation_server/utils/weights.py", line 73, in get_sharded
filename, tensor_name = self.get_filename(tensor_name)
File "/opt/conda/lib/python3.9/site-packages/text_generation_server/utils/weights.py", line 49, in get_filename
raise RuntimeError(f"weight {tensor_name} does not exist")
RuntimeError: weight encoder.embed_tokens.weight does not exist
2023-07-06T20:12:51.827178Z INFO text_generation_launcher: Shutting down shards
Looking at it in more detail this is the same issue as RuntimeError: weight shared.weight does not exist
at https://github.com/huggingface/text-generation-inference/issues/541
I am also getting same error with falcon7B model, with most of the MPT and falcon models.
Model: falcon-7B
RuntimeError: weight lm_head.weight does not exist
The PR above should help. It's only a matter of weight naming
Thanks @Narsil , I have just tested it (with flan-t5-xl
) and I can confirm that your PR (https://github.com/huggingface/text-generation-inference/pull/561 - which just got merged) has fixed this issue!
Thanks!
Thanks @Narsil, it does work for me too with flan-t5
, but I just tried with t5 and the problem seems to still occur.
2023-07-13T06:35:24.607879Z INFO text_generation_launcher: Args { model_id: "t5-base", revision: None, sharded: None, num_shard: Some(1), quantize: None, dtype: None, trust_remote_code: false, max_concurrent_requests: 128, max_best_of: 2, max_stop_sequences: 4, max_input_length: 1024, max_total_tokens: 2048, waiting_served_ratio: 1.2, max_batch_prefill_tokens: 4096, max_batch_total_tokens: 16000, max_waiting_tokens: 20, hostname: "70904c856920", port: 80, shard_uds_path: "/tmp/text-generation-server", master_addr: "localhost", master_port: 29500, huggingface_hub_cache: Some("/data"), weights_cache_override: None, disable_custom_kernels: false, json_output: false, otlp_endpoint: None, cors_allow_origin: [], watermark_gamma: None, watermark_delta: None, ngrok: false, ngrok_authtoken: None, ngrok_domain: None, ngrok_username: None, ngrok_password: None, env: false }
2023-07-13T06:35:24.608058Z INFO text_generation_launcher: Starting download process.
2023-07-13T06:35:28.958139Z INFO download: text_generation_launcher: Download file: model.safetensors
2023-07-13T06:35:30.635704Z INFO download: text_generation_launcher: Downloaded /data/models--t5-base/snapshots/fe6d9bf207cd3337512ca838a8b453f87a9178ef/model.safetensors in 0:00:01.
2023-07-13T06:35:30.635867Z INFO download: text_generation_launcher: Download: [1/1] -- ETA: 0
2023-07-13T06:35:31.326113Z INFO text_generation_launcher: Successfully downloaded weights.
2023-07-13T06:35:31.326314Z INFO text_generation_launcher: Starting shard 0
2023-07-13T06:35:35.984848Z WARN shard-manager: text_generation_launcher: We're not using custom kernels.
rank=0
2023-07-13T06:35:41.274660Z ERROR shard-manager: text_generation_launcher: Error when initializing model
Traceback (most recent call last):
File "/opt/conda/bin/text-generation-server", line 8, in <module>
sys.exit(app())
File "/opt/conda/lib/python3.9/site-packages/typer/main.py", line 311, in __call__
return get_command(self)(*args, **kwargs)
File "/opt/conda/lib/python3.9/site-packages/click/core.py", line 1130, in __call__
return self.main(*args, **kwargs)
File "/opt/conda/lib/python3.9/site-packages/typer/core.py", line 778, in main
return _main(
File "/opt/conda/lib/python3.9/site-packages/typer/core.py", line 216, in _main
rv = self.invoke(ctx)
File "/opt/conda/lib/python3.9/site-packages/click/core.py", line 1657, in invoke
return _process_result(sub_ctx.command.invoke(sub_ctx))
File "/opt/conda/lib/python3.9/site-packages/click/core.py", line 1404, in invoke
return ctx.invoke(self.callback, **ctx.params)
File "/opt/conda/lib/python3.9/site-packages/click/core.py", line 760, in invoke
return __callback(*args, **kwargs)
File "/opt/conda/lib/python3.9/site-packages/typer/main.py", line 683, in wrapper
return callback(**use_params) # type: ignore
File "/opt/conda/lib/python3.9/site-packages/text_generation_server/cli.py", line 78, in serve
server.serve(
File "/opt/conda/lib/python3.9/site-packages/text_generation_server/server.py", line 175, in serve
asyncio.run(
File "/opt/conda/lib/python3.9/asyncio/runners.py", line 44, in run
return loop.run_until_complete(main)
File "/opt/conda/lib/python3.9/asyncio/base_events.py", line 634, in run_until_complete
self.run_forever()
File "/opt/conda/lib/python3.9/asyncio/base_events.py", line 601, in run_forever
self._run_once()
File "/opt/conda/lib/python3.9/asyncio/base_events.py", line 1905, in _run_once
handle._run()
File "/opt/conda/lib/python3.9/asyncio/events.py", line 80, in _run
self._context.run(self._callback, *self._args)
> File "/opt/conda/lib/python3.9/site-packages/text_generation_server/server.py", line 142, in serve_inner
model = get_model(
File "/opt/conda/lib/python3.9/site-packages/text_generation_server/models/__init__.py", line 279, in get_model
return T5Sharded(
File "/opt/conda/lib/python3.9/site-packages/text_generation_server/models/t5.py", line 70, in __init__
model = T5ForConditionalGeneration(config, weights)
File "/opt/conda/lib/python3.9/site-packages/text_generation_server/models/custom_modeling/t5_modeling.py", line 1035, in __init__
self.lm_head = TensorParallelHead.load(
File "/opt/conda/lib/python3.9/site-packages/text_generation_server/utils/layers.py", line 194, in load
weight = weights.get_tensor(f"{prefix}.weight")
File "/opt/conda/lib/python3.9/site-packages/text_generation_server/utils/weights.py", line 64, in get_tensor
filename, tensor_name = self.get_filename(tensor_name)
File "/opt/conda/lib/python3.9/site-packages/text_generation_server/utils/weights.py", line 51, in get_filename
raise RuntimeError(f"weight {tensor_name} does not exist")
RuntimeError: weight lm_head.weight does not exist
Thanks @Narsil, it does work for me too with
flan-t5
, but I just tried with t5 and the problem seems to still occur.2023-07-13T06:35:24.607879Z INFO text_generation_launcher: Args { model_id: "t5-base", revision: None, sharded: None, num_shard: Some(1), quantize: None, dtype: None, trust_remote_code: false, max_concurrent_requests: 128, max_best_of: 2, max_stop_sequences: 4, max_input_length: 1024, max_total_tokens: 2048, waiting_served_ratio: 1.2, max_batch_prefill_tokens: 4096, max_batch_total_tokens: 16000, max_waiting_tokens: 20, hostname: "70904c856920", port: 80, shard_uds_path: "/tmp/text-generation-server", master_addr: "localhost", master_port: 29500, huggingface_hub_cache: Some("/data"), weights_cache_override: None, disable_custom_kernels: false, json_output: false, otlp_endpoint: None, cors_allow_origin: [], watermark_gamma: None, watermark_delta: None, ngrok: false, ngrok_authtoken: None, ngrok_domain: None, ngrok_username: None, ngrok_password: None, env: false } 2023-07-13T06:35:24.608058Z INFO text_generation_launcher: Starting download process. 2023-07-13T06:35:28.958139Z INFO download: text_generation_launcher: Download file: model.safetensors 2023-07-13T06:35:30.635704Z INFO download: text_generation_launcher: Downloaded /data/models--t5-base/snapshots/fe6d9bf207cd3337512ca838a8b453f87a9178ef/model.safetensors in 0:00:01. 2023-07-13T06:35:30.635867Z INFO download: text_generation_launcher: Download: [1/1] -- ETA: 0 2023-07-13T06:35:31.326113Z INFO text_generation_launcher: Successfully downloaded weights. 2023-07-13T06:35:31.326314Z INFO text_generation_launcher: Starting shard 0 2023-07-13T06:35:35.984848Z WARN shard-manager: text_generation_launcher: We're not using custom kernels. rank=0 2023-07-13T06:35:41.274660Z ERROR shard-manager: text_generation_launcher: Error when initializing model Traceback (most recent call last): File "/opt/conda/bin/text-generation-server", line 8, in <module> sys.exit(app()) File "/opt/conda/lib/python3.9/site-packages/typer/main.py", line 311, in __call__ return get_command(self)(*args, **kwargs) File "/opt/conda/lib/python3.9/site-packages/click/core.py", line 1130, in __call__ return self.main(*args, **kwargs) File "/opt/conda/lib/python3.9/site-packages/typer/core.py", line 778, in main return _main( File "/opt/conda/lib/python3.9/site-packages/typer/core.py", line 216, in _main rv = self.invoke(ctx) File "/opt/conda/lib/python3.9/site-packages/click/core.py", line 1657, in invoke return _process_result(sub_ctx.command.invoke(sub_ctx)) File "/opt/conda/lib/python3.9/site-packages/click/core.py", line 1404, in invoke return ctx.invoke(self.callback, **ctx.params) File "/opt/conda/lib/python3.9/site-packages/click/core.py", line 760, in invoke return __callback(*args, **kwargs) File "/opt/conda/lib/python3.9/site-packages/typer/main.py", line 683, in wrapper return callback(**use_params) # type: ignore File "/opt/conda/lib/python3.9/site-packages/text_generation_server/cli.py", line 78, in serve server.serve( File "/opt/conda/lib/python3.9/site-packages/text_generation_server/server.py", line 175, in serve asyncio.run( File "/opt/conda/lib/python3.9/asyncio/runners.py", line 44, in run return loop.run_until_complete(main) File "/opt/conda/lib/python3.9/asyncio/base_events.py", line 634, in run_until_complete self.run_forever() File "/opt/conda/lib/python3.9/asyncio/base_events.py", line 601, in run_forever self._run_once() File "/opt/conda/lib/python3.9/asyncio/base_events.py", line 1905, in _run_once handle._run() File "/opt/conda/lib/python3.9/asyncio/events.py", line 80, in _run self._context.run(self._callback, *self._args) > File "/opt/conda/lib/python3.9/site-packages/text_generation_server/server.py", line 142, in serve_inner model = get_model( File "/opt/conda/lib/python3.9/site-packages/text_generation_server/models/__init__.py", line 279, in get_model return T5Sharded( File "/opt/conda/lib/python3.9/site-packages/text_generation_server/models/t5.py", line 70, in __init__ model = T5ForConditionalGeneration(config, weights) File "/opt/conda/lib/python3.9/site-packages/text_generation_server/models/custom_modeling/t5_modeling.py", line 1035, in __init__ self.lm_head = TensorParallelHead.load( File "/opt/conda/lib/python3.9/site-packages/text_generation_server/utils/layers.py", line 194, in load weight = weights.get_tensor(f"{prefix}.weight") File "/opt/conda/lib/python3.9/site-packages/text_generation_server/utils/weights.py", line 64, in get_tensor filename, tensor_name = self.get_filename(tensor_name) File "/opt/conda/lib/python3.9/site-packages/text_generation_server/utils/weights.py", line 51, in get_filename raise RuntimeError(f"weight {tensor_name} does not exist") RuntimeError: weight lm_head.weight does not exist
Try to update docker and run latest image:
docker pull ghcr.io/huggingface/text-generation-inference:latest
docker run --gpus all --shm-size 1g -p 8080:80 -v $PWD/data:/data ghcr.io/huggingface/text-generation-inference:latest --model-id google/flan-t5-base --num-shard 2
Thanks @chumpblocckami - I did and it does work well with the flan-t5
, but not with the 'regular' t5. You can reproduce by using:
docker run --shm-size 1g \
-p 8080:80 \
--gpus all ghcr.io/huggingface/text-generation-inference:latest \
--model-id t5-base \
--num-shard 1
Shall I create a separate issue for this?
The same issue happens for OPT
RuntimeError: weight model.decoder.embed_tokens.weight does not exist
Got it too, server version 1.0.3 (using docker), and also with latest,
Fails with facebook/opt-125m
but worked for me with another model - gpt2
.
Hello @Narsil - can we rebuild this if haven't yet
763104351884.dkr.ecr.us-east-2.amazonaws.com/huggingface-pytorch-tgi-inference:2.3.0-tgi2.0.2-gpu-py310-cu121-ubuntu22.04
cc @philschmid
Seeing this when working to deploy sagemaker LLama3-Instruct
from sagemaker.huggingface import HuggingFaceModel, get_huggingface_llm_image_uri
get_huggingface_llm_image_uri("huggingface",version="2.0.2")
{
"_name_or_path": "meta-llama/Meta-Llama-3-8B-Instruct",
"architectures": [
"LlamaModel"
],
"attention_bias": false,
"attention_dropout": 0.0,
"bos_token_id": 128000,
"eos_token_id": 128009,
"hidden_act": "silu",
"hidden_size": 4096,
"initializer_range": 0.02,
"intermediate_size": 14336,
"max_position_embeddings": 8192,
"mlp_bias": false,
"model_type": "llama",
"num_attention_heads": 32,
"num_hidden_layers": 32,
"num_key_value_heads": 8,
"pretraining_tp": 1,
"rms_norm_eps": 1e-05,
"rope_scaling": null,
"rope_theta": 500000.0,
"tie_word_embeddings": false,
"torch_dtype": "float32",
"transformers_version": "4.41.2",
"use_cache": true,
"vocab_size": 128256
}
Hey @Jacobsolawetz,
Can you please try version 2.0.3
?
After running:
I recieve:
I tried multiple small models but every one raise the same issue.
Any tips?
Thanks