huggingface / text-generation-inference

Large Language Model Text Generation Inference
http://hf.co/docs/text-generation-inference
Apache License 2.0
8.9k stars 1.05k forks source link

RuntimeError: weight encoder.embed_tokens.weight does not exist #556

Closed chumpblocckami closed 1 year ago

chumpblocckami commented 1 year ago

After running:

docker run --gpus all --shm-size 1g -p 8080:80 -v $PWD/data:/data ghcr.io/huggingface/text-generation-inference:0.9 --model-id google/flan-t5-small --num-shard 1

I recieve:

RuntimeError: weight encoder.embed_tokens.weight does not exist

I tried multiple small models but every one raise the same issue.

Any tips?

Thanks

zoltan-fedor commented 1 year ago

Same issue here with flan-t5-xl. I am using v0.9.1 on EKS.

Full startup log below:

{"timestamp":"2023-07-06T19:14:42.852088Z","level":"INFO","fields":{"message":"Args { model_id: \"google/flan-t5-xl\", revision: None, sharded: None, num_shard: Some(1), quantize: Some(Bitsandbytes), dtype: None, trust_remote_code: false, max_concurrent_requests: 128, max_best_of: 2, max_stop_sequences: 4, max_input_length: 1024, max_total_tokens: 2048, waiting_served_ratio: 1.2, max_batch_prefill_tokens: 4096, max_batch_total_tokens: 16000, max_waiting_tokens: 20, hostname: \"flan-t5-xl-64959c4d74-qs64q\", port: 80, shard_uds_path: \"/tmp/text-generation-server\", master_addr: \"localhost\", master_port: 29500, huggingface_hub_cache: Some(\"/data\"), weights_cache_override: None, disable_custom_kernels: false, json_output: true, otlp_endpoint: None, cors_allow_origin: [], watermark_gamma: None, watermark_delta: None, ngrok: false, ngrok_authtoken: None, ngrok_domain: None, ngrok_username: None, ngrok_password: None, env: false }"},"target":"text_generation_launcher"}
{"timestamp":"2023-07-06T19:14:42.852211Z","level":"INFO","fields":{"message":"Starting download process."},"target":"text_generation_launcher"}
{"timestamp":"2023-07-06T19:14:44.549952Z","level":"WARN","fields":{"message":"No safetensors weights found for model google/flan-t5-xl at revision None. Downloading PyTorch weights.\n"},"target":"text_generation_launcher","span":{"name":"download"},"spans":[{"name":"download"}]}
{"timestamp":"2023-07-06T19:14:44.604291Z","level":"INFO","fields":{"message":"Download file: pytorch_model-00001-of-00002.bin\n"},"target":"text_generation_launcher","span":{"name":"download"},"spans":[{"name":"download"}]}
{"timestamp":"2023-07-06T19:15:03.211958Z","level":"INFO","fields":{"message":"Downloaded /data/models--google--flan-t5-xl/snapshots/53fd1e22aa944eee1fd336f9aee8a437e01676ce/pytorch_model-00001-of-00002.bin in 0:00:18.\n"},"target":"text_generation_launcher","span":{"name":"download"},"spans":[{"name":"download"}]}
{"timestamp":"2023-07-06T19:15:03.212049Z","level":"INFO","fields":{"message":"Download: [1/2] -- ETA: 0:00:18\n"},"target":"text_generation_launcher","span":{"name":"download"},"spans":[{"name":"download"}]}
{"timestamp":"2023-07-06T19:15:03.212305Z","level":"INFO","fields":{"message":"Download file: pytorch_model-00002-of-00002.bin\n"},"target":"text_generation_launcher","span":{"name":"download"},"spans":[{"name":"download"}]}
{"timestamp":"2023-07-06T19:15:09.970978Z","level":"INFO","fields":{"message":"Downloaded /data/models--google--flan-t5-xl/snapshots/53fd1e22aa944eee1fd336f9aee8a437e01676ce/pytorch_model-00002-of-00002.bin in 0:00:06.\n"},"target":"text_generation_launcher","span":{"name":"download"},"spans":[{"name":"download"}]}
{"timestamp":"2023-07-06T19:15:09.971051Z","level":"INFO","fields":{"message":"Download: [2/2] -- ETA: 0\n"},"target":"text_generation_launcher","span":{"name":"download"},"spans":[{"name":"download"}]}
{"timestamp":"2023-07-06T19:15:09.971146Z","level":"WARN","fields":{"message":"No safetensors weights found for model google/flan-t5-xl at revision None. Converting PyTorch weights to safetensors.\n"},"target":"text_generation_launcher","span":{"name":"download"},"spans":[{"name":"download"}]}
{"timestamp":"2023-07-06T19:16:15.141453Z","level":"INFO","fields":{"message":"Convert: [1/2] -- Took: 0:01:05.169846\n"},"target":"text_generation_launcher","span":{"name":"download"},"spans":[{"name":"download"}]}
{"timestamp":"2023-07-06T19:16:24.913801Z","level":"INFO","fields":{"message":"Convert: [2/2] -- Took: 0:00:09.772093\n"},"target":"text_generation_launcher","span":{"name":"download"},"spans":[{"name":"download"}]}
{"timestamp":"2023-07-06T19:16:25.262110Z","level":"INFO","fields":{"message":"Successfully downloaded weights."},"target":"text_generation_launcher"}
{"timestamp":"2023-07-06T19:16:25.262659Z","level":"INFO","fields":{"message":"Starting shard 0"},"target":"text_generation_launcher"}
{"timestamp":"2023-07-06T19:16:29.284643Z","level":"WARN","fields":{"message":"We're not using custom kernels.\n"},"target":"text_generation_launcher","span":{"rank":0,"name":"shard-manager"},"spans":[{"rank":0,"name":"shard-manager"}]}
{"timestamp":"2023-07-06T19:16:30.102714Z","level":"ERROR","fields":{"message":"Error when initializing model\nTraceback (most recent call last):\n  File \"/opt/conda/lib/python3.9/site-packages/text_generation_server/models/custom_modeling/t5_modeling.py\", line 1005, in __init__\n    self.shared = TensorParallelEmbedding(prefix=\"shared\", weights=weights)\n  File \"/opt/conda/lib/python3.9/site-packages/text_generation_server/utils/layers.py\", line 280, in __init__\n    weight = weights.get_sharded(f\"{prefix}.weight\", dim=0)\n  File \"/opt/conda/lib/python3.9/site-packages/text_generation_server/utils/weights.py\", line 73, in get_sharded\n    filename, tensor_name = self.get_filename(tensor_name)\n  File \"/opt/conda/lib/python3.9/site-packages/text_generation_server/utils/weights.py\", line 49, in get_filename\n    raise RuntimeError(f\"weight {tensor_name} does not exist\")\nRuntimeError: weight shared.weight does not exist\n\nDuring handling of the above exception, another exception occurred:\n\nTraceback (most recent call last):\n  File \"/opt/conda/bin/text-generation-server\", line 8, in <module>\n    sys.exit(app())\n  File \"/opt/conda/lib/python3.9/site-packages/typer/main.py\", line 311, in __call__\n    return get_command(self)(*args, **kwargs)\n  File \"/opt/conda/lib/python3.9/site-packages/click/core.py\", line 1130, in __call__\n    return self.main(*args, **kwargs)\n  File \"/opt/conda/lib/python3.9/site-packages/typer/core.py\", line 778, in main\n    return _main(\n  File \"/opt/conda/lib/python3.9/site-packages/typer/core.py\", line 216, in _main\n    rv = self.invoke(ctx)\n  File \"/opt/conda/lib/python3.9/site-packages/click/core.py\", line 1657, in invoke\n    return _process_result(sub_ctx.command.invoke(sub_ctx))\n  File \"/opt/conda/lib/python3.9/site-packages/click/core.py\", line 1404, in invoke\n    return ctx.invoke(self.callback, **ctx.params)\n  File \"/opt/conda/lib/python3.9/site-packages/click/core.py\", line 760, in invoke\n    return __callback(*args, **kwargs)\n  File \"/opt/conda/lib/python3.9/site-packages/typer/main.py\", line 683, in wrapper\n    return callback(**use_params)  # type: ignore\n  File \"/opt/conda/lib/python3.9/site-packages/text_generation_server/cli.py\", line 78, in serve\n    server.serve(\n  File \"/opt/conda/lib/python3.9/site-packages/text_generation_server/server.py\", line 166, in serve\n    asyncio.run(\n  File \"/opt/conda/lib/python3.9/asyncio/runners.py\", line 44, in run\n    return loop.run_until_complete(main)\n  File \"/opt/conda/lib/python3.9/asyncio/base_events.py\", line 634, in run_until_complete\n    self.run_forever()\n  File \"/opt/conda/lib/python3.9/asyncio/base_events.py\", line 601, in run_forever\n    self._run_once()\n  File \"/opt/conda/lib/python3.9/asyncio/base_events.py\", line 1905, in _run_once\n    handle._run()\n  File \"/opt/conda/lib/python3.9/asyncio/events.py\", line 80, in _run\n    self._context.run(self._callback, *self._args)\n> File \"/opt/conda/lib/python3.9/site-packages/text_generation_server/server.py\", line 133, in serve_inner\n    model = get_model(\n  File \"/opt/conda/lib/python3.9/site-packages/text_generation_server/models/__init__.py\", line 279, in get_model\n    return T5Sharded(\n  File \"/opt/conda/lib/python3.9/site-packages/text_generation_server/models/t5.py\", line 61, in __init__\n    model = T5ForConditionalGeneration(config, weights)\n  File \"/opt/conda/lib/python3.9/site-packages/text_generation_server/models/custom_modeling/t5_modeling.py\", line 1007, in __init__\n    self.shared = TensorParallelEmbedding(\n  File \"/opt/conda/lib/python3.9/site-packages/text_generation_server/utils/layers.py\", line 280, in __init__\n    weight = weights.get_sharded(f\"{prefix}.weight\", dim=0)\n  File \"/opt/conda/lib/python3.9/site-packages/text_generation_server/utils/weights.py\", line 73, in get_sharded\n    filename, tensor_name = self.get_filename(tensor_name)\n  File \"/opt/conda/lib/python3.9/site-packages/text_generation_server/utils/weights.py\", line 49, in get_filename\n    raise RuntimeError(f\"weight {tensor_name} does not exist\")\nRuntimeError: weight encoder.embed_tokens.weight does not exist\n"},"target":"text_generation_launcher","span":{"rank":0,"name":"shard-manager"},"spans":[{"rank":0,"name":"shard-manager"}]}
{"timestamp":"2023-07-06T19:16:30.667766Z","level":"ERROR","fields":{"message":"Shard 0 failed to start"},"target":"text_generation_launcher"}
{"timestamp":"2023-07-06T19:16:30.667795Z","level":"ERROR","fields":{"message":"Traceback (most recent call last):\n\n  File \"/opt/conda/lib/python3.9/site-packages/text_generation_server/models/custom_modeling/t5_modeling.py\", line 1005, in __init__\n    self.shared = TensorParallelEmbedding(prefix=\"shared\", weights=weights)\n\n  File \"/opt/conda/lib/python3.9/site-packages/text_generation_server/utils/layers.py\", line 280, in __init__\n    weight = weights.get_sharded(f\"{prefix}.weight\", dim=0)\n\n  File \"/opt/conda/lib/python3.9/site-packages/text_generation_server/utils/weights.py\", line 73, in get_sharded\n    filename, tensor_name = self.get_filename(tensor_name)\n\n  File \"/opt/conda/lib/python3.9/site-packages/text_generation_server/utils/weights.py\", line 49, in get_filename\n    raise RuntimeError(f\"weight {tensor_name} does not exist\")\n\nRuntimeError: weight shared.weight does not exist\n\n\nDuring handling of the above exception, another exception occurred:\n\n\nTraceback (most recent call last):\n\n  File \"/opt/conda/bin/text-generation-server\", line 8, in <module>\n    sys.exit(app())\n\n  File \"/opt/conda/lib/python3.9/site-packages/text_generation_server/cli.py\", line 78, in serve\n    server.serve(\n\n  File \"/opt/conda/lib/python3.9/site-packages/text_generation_server/server.py\", line 166, in serve\n    asyncio.run(\n\n  File \"/opt/conda/lib/python3.9/asyncio/runners.py\", line 44, in run\n    return loop.run_until_complete(main)\n\n  File \"/opt/conda/lib/python3.9/asyncio/base_events.py\", line 647, in run_until_complete\n    return future.result()\n\n  File \"/opt/conda/lib/python3.9/site-packages/text_generation_server/server.py\", line 133, in serve_inner\n    model = get_model(\n\n  File \"/opt/conda/lib/python3.9/site-packages/text_generation_server/models/__init__.py\", line 279, in get_model\n    return T5Sharded(\n\n  File \"/opt/conda/lib/python3.9/site-packages/text_generation_server/models/t5.py\", line 61, in __init__\n    model = T5ForConditionalGeneration(config, weights)\n\n  File \"/opt/conda/lib/python3.9/site-packages/text_generation_server/models/custom_modeling/t5_modeling.py\", line 1007, in __init__\n    self.shared = TensorParallelEmbedding(\n\n  File \"/opt/conda/lib/python3.9/site-packages/text_generation_server/utils/layers.py\", line 280, in __init__\n    weight = weights.get_sharded(f\"{prefix}.weight\", dim=0)\n\n  File \"/opt/conda/lib/python3.9/site-packages/text_generation_server/utils/weights.py\", line 73, in get_sharded\n    filename, tensor_name = self.get_filename(tensor_name)\n\n  File \"/opt/conda/lib/python3.9/site-packages/text_generation_server/utils/weights.py\", line 49, in get_filename\n    raise RuntimeError(f\"weight {tensor_name} does not exist\")\n\nRuntimeError: weight encoder.embed_tokens.weight does not exist\n\n"},"target":"text_generation_launcher"}
{"timestamp":"2023-07-06T19:16:30.667823Z","level":"INFO","fields":{"message":"Shutting down shards"},"target":"text_generation_launcher"}
Error: ShardCannotStart

When sharding disabled (same error, but easier to read):

2023-07-06T20:10:55.686866Z  INFO text_generation_launcher: Args { model_id: "google/flan-t5-xl", revision: None, sharded: Some(false), num_shard: Some(1), quantize: Some(Bitsandbytes), dtype: None, trust_remote_code: false, max_concurrent_requests: 128, max_best_of: 2, max_stop_sequences: 4, max_input_length: 1024, max_total_tokens: 2048, waiting_served_ratio: 1.2, max_batch_prefill_tokens: 4096, max_batch_total_tokens: 16000, max_waiting_tokens: 20, port: 80, shard_uds_path: "/tmp/text-generation-server", master_addr: "localhost", master_port: 29500, huggingface_hub_cache: Some("/data"), weights_cache_override: None, disable_custom_kernels: false, json_output: false, otlp_endpoint: None, cors_allow_origin: [], watermark_gamma: None, watermark_delta: None, ngrok: false, ngrok_authtoken: None, ngrok_domain: None, ngrok_username: None, ngrok_password: None, env: false }
2023-07-06T20:10:55.686971Z  INFO text_generation_launcher: Starting download process.
2023-07-06T20:10:57.341715Z  WARN download: text_generation_launcher: No safetensors weights found for model google/flan-t5-xl at revision None. Downloading PyTorch weights.

2023-07-06T20:10:57.417081Z  INFO download: text_generation_launcher: Download file: pytorch_model-00001-of-00002.bin

2023-07-06T20:11:18.169311Z  INFO download: text_generation_launcher: Downloaded /data/models--google--flan-t5-xl/snapshots/53fd1e22aa944eee1fd336f9aee8a437e01676ce/pytorch_model-00001-of-00002.bin in 0:00:20.

2023-07-06T20:11:18.169386Z  INFO download: text_generation_launcher: Download: [1/2] -- ETA: 0:00:20

2023-07-06T20:11:18.169595Z  INFO download: text_generation_launcher: Download file: pytorch_model-00002-of-00002.bin

2023-07-06T20:11:25.050713Z  INFO download: text_generation_launcher: Downloaded /data/models--google--flan-t5-xl/snapshots/53fd1e22aa944eee1fd336f9aee8a437e01676ce/pytorch_model-00002-of-00002.bin in 0:00:06.

2023-07-06T20:11:25.050803Z  INFO download: text_generation_launcher: Download: [2/2] -- ETA: 0

2023-07-06T20:11:25.050899Z  WARN download: text_generation_launcher: No safetensors weights found for model google/flan-t5-xl at revision None. Converting PyTorch weights to safetensors.

2023-07-06T20:12:30.361334Z  INFO download: text_generation_launcher: Convert: [1/2] -- Took: 0:01:05.309101

2023-07-06T20:12:40.112889Z  INFO download: text_generation_launcher: Convert: [2/2] -- Took: 0:00:09.752118

2023-07-06T20:12:42.517781Z  INFO text_generation_launcher: Successfully downloaded weights.
2023-07-06T20:12:42.518379Z  INFO text_generation_launcher: Starting shard 0
2023-07-06T20:12:50.458364Z  WARN shard-manager: text_generation_launcher: We're not using custom kernels.
 rank=0
2023-07-06T20:12:51.265848Z ERROR shard-manager: text_generation_launcher: Error when initializing model
Traceback (most recent call last):
  File "/opt/conda/lib/python3.9/site-packages/text_generation_server/models/custom_modeling/t5_modeling.py", line 1005, in __init__
    self.shared = TensorParallelEmbedding(prefix="shared", weights=weights)
  File "/opt/conda/lib/python3.9/site-packages/text_generation_server/utils/layers.py", line 268, in __init__
    weight = weights.get_sharded(f"{prefix}.weight", dim=0)
  File "/opt/conda/lib/python3.9/site-packages/text_generation_server/utils/weights.py", line 73, in get_sharded
    filename, tensor_name = self.get_filename(tensor_name)
  File "/opt/conda/lib/python3.9/site-packages/text_generation_server/utils/weights.py", line 49, in get_filename
    raise RuntimeError(f"weight {tensor_name} does not exist")
RuntimeError: weight shared.weight does not exist

During handling of the above exception, another exception occurred:

Traceback (most recent call last):
  File "/opt/conda/bin/text-generation-server", line 8, in <module>
    sys.exit(app())
  File "/opt/conda/lib/python3.9/site-packages/typer/main.py", line 311, in __call__
    return get_command(self)(*args, **kwargs)
  File "/opt/conda/lib/python3.9/site-packages/click/core.py", line 1130, in __call__
    return self.main(*args, **kwargs)
  File "/opt/conda/lib/python3.9/site-packages/typer/core.py", line 778, in main
    return _main(
  File "/opt/conda/lib/python3.9/site-packages/typer/core.py", line 216, in _main
    rv = self.invoke(ctx)
  File "/opt/conda/lib/python3.9/site-packages/click/core.py", line 1657, in invoke
    return _process_result(sub_ctx.command.invoke(sub_ctx))
  File "/opt/conda/lib/python3.9/site-packages/click/core.py", line 1404, in invoke
    return ctx.invoke(self.callback, **ctx.params)
  File "/opt/conda/lib/python3.9/site-packages/click/core.py", line 760, in invoke
    return __callback(*args, **kwargs)
  File "/opt/conda/lib/python3.9/site-packages/typer/main.py", line 683, in wrapper
    return callback(**use_params)  # type: ignore
  File "/opt/conda/lib/python3.9/site-packages/text_generation_server/cli.py", line 78, in serve
    server.serve(
  File "/opt/conda/lib/python3.9/site-packages/text_generation_server/server.py", line 166, in serve
    asyncio.run(
  File "/opt/conda/lib/python3.9/asyncio/runners.py", line 44, in run
    return loop.run_until_complete(main)
  File "/opt/conda/lib/python3.9/asyncio/base_events.py", line 634, in run_until_complete
    self.run_forever()
  File "/opt/conda/lib/python3.9/asyncio/base_events.py", line 601, in run_forever
    self._run_once()
  File "/opt/conda/lib/python3.9/asyncio/base_events.py", line 1905, in _run_once
    handle._run()
  File "/opt/conda/lib/python3.9/asyncio/events.py", line 80, in _run
    self._context.run(self._callback, *self._args)
> File "/opt/conda/lib/python3.9/site-packages/text_generation_server/server.py", line 133, in serve_inner
    model = get_model(
  File "/opt/conda/lib/python3.9/site-packages/text_generation_server/models/__init__.py", line 274, in get_model
    return T5Sharded(
  File "/opt/conda/lib/python3.9/site-packages/text_generation_server/models/t5.py", line 61, in __init__
    model = T5ForConditionalGeneration(config, weights)
  File "/opt/conda/lib/python3.9/site-packages/text_generation_server/models/custom_modeling/t5_modeling.py", line 1007, in __init__
    self.shared = TensorParallelEmbedding(
  File "/opt/conda/lib/python3.9/site-packages/text_generation_server/utils/layers.py", line 268, in __init__
    weight = weights.get_sharded(f"{prefix}.weight", dim=0)
  File "/opt/conda/lib/python3.9/site-packages/text_generation_server/utils/weights.py", line 73, in get_sharded
    filename, tensor_name = self.get_filename(tensor_name)
  File "/opt/conda/lib/python3.9/site-packages/text_generation_server/utils/weights.py", line 49, in get_filename
    raise RuntimeError(f"weight {tensor_name} does not exist")
RuntimeError: weight encoder.embed_tokens.weight does not exist
 rank=0
2023-07-06T20:12:51.827130Z ERROR text_generation_launcher: Shard 0 failed to start
2023-07-06T20:12:51.827155Z ERROR text_generation_launcher: Traceback (most recent call last):

  File "/opt/conda/lib/python3.9/site-packages/text_generation_server/models/custom_modeling/t5_modeling.py", line 1005, in __init__
    self.shared = TensorParallelEmbedding(prefix="shared", weights=weights)

  File "/opt/conda/lib/python3.9/site-packages/text_generation_server/utils/layers.py", line 268, in __init__
    weight = weights.get_sharded(f"{prefix}.weight", dim=0)

  File "/opt/conda/lib/python3.9/site-packages/text_generation_server/utils/weights.py", line 73, in get_sharded
    filename, tensor_name = self.get_filename(tensor_name)

  File "/opt/conda/lib/python3.9/site-packages/text_generation_server/utils/weights.py", line 49, in get_filename
    raise RuntimeError(f"weight {tensor_name} does not exist")

RuntimeError: weight shared.weight does not exist

Error: ShardCannotStart
During handling of the above exception, another exception occurred:

Traceback (most recent call last):

  File "/opt/conda/bin/text-generation-server", line 8, in <module>
    sys.exit(app())

  File "/opt/conda/lib/python3.9/site-packages/text_generation_server/cli.py", line 78, in serve
    server.serve(

  File "/opt/conda/lib/python3.9/site-packages/text_generation_server/server.py", line 166, in serve
    asyncio.run(

  File "/opt/conda/lib/python3.9/asyncio/runners.py", line 44, in run
    return loop.run_until_complete(main)

  File "/opt/conda/lib/python3.9/asyncio/base_events.py", line 647, in run_until_complete
    return future.result()

  File "/opt/conda/lib/python3.9/site-packages/text_generation_server/server.py", line 133, in serve_inner
    model = get_model(

  File "/opt/conda/lib/python3.9/site-packages/text_generation_server/models/__init__.py", line 274, in get_model
    return T5Sharded(

  File "/opt/conda/lib/python3.9/site-packages/text_generation_server/models/t5.py", line 61, in __init__
    model = T5ForConditionalGeneration(config, weights)

  File "/opt/conda/lib/python3.9/site-packages/text_generation_server/models/custom_modeling/t5_modeling.py", line 1007, in __init__
    self.shared = TensorParallelEmbedding(

  File "/opt/conda/lib/python3.9/site-packages/text_generation_server/utils/layers.py", line 268, in __init__
    weight = weights.get_sharded(f"{prefix}.weight", dim=0)

  File "/opt/conda/lib/python3.9/site-packages/text_generation_server/utils/weights.py", line 73, in get_sharded
    filename, tensor_name = self.get_filename(tensor_name)

  File "/opt/conda/lib/python3.9/site-packages/text_generation_server/utils/weights.py", line 49, in get_filename
    raise RuntimeError(f"weight {tensor_name} does not exist")

RuntimeError: weight encoder.embed_tokens.weight does not exist

2023-07-06T20:12:51.827178Z  INFO text_generation_launcher: Shutting down shards
zoltan-fedor commented 1 year ago

Looking at it in more detail this is the same issue as RuntimeError: weight shared.weight does not exist at https://github.com/huggingface/text-generation-inference/issues/541

TalhaUusuf commented 1 year ago

I am also getting same error with falcon7B model, with most of the MPT and falcon models.

Model: falcon-7B

RuntimeError: weight lm_head.weight does not exist

Narsil commented 1 year ago

The PR above should help. It's only a matter of weight naming

zoltan-fedor commented 1 year ago

Thanks @Narsil , I have just tested it (with flan-t5-xl) and I can confirm that your PR (https://github.com/huggingface/text-generation-inference/pull/561 - which just got merged) has fixed this issue! Thanks!

r0n13 commented 1 year ago

Thanks @Narsil, it does work for me too with flan-t5, but I just tried with t5 and the problem seems to still occur.

2023-07-13T06:35:24.607879Z  INFO text_generation_launcher: Args { model_id: "t5-base", revision: None, sharded: None, num_shard: Some(1), quantize: None, dtype: None, trust_remote_code: false, max_concurrent_requests: 128, max_best_of: 2, max_stop_sequences: 4, max_input_length: 1024, max_total_tokens: 2048, waiting_served_ratio: 1.2, max_batch_prefill_tokens: 4096, max_batch_total_tokens: 16000, max_waiting_tokens: 20, hostname: "70904c856920", port: 80, shard_uds_path: "/tmp/text-generation-server", master_addr: "localhost", master_port: 29500, huggingface_hub_cache: Some("/data"), weights_cache_override: None, disable_custom_kernels: false, json_output: false, otlp_endpoint: None, cors_allow_origin: [], watermark_gamma: None, watermark_delta: None, ngrok: false, ngrok_authtoken: None, ngrok_domain: None, ngrok_username: None, ngrok_password: None, env: false }
2023-07-13T06:35:24.608058Z  INFO text_generation_launcher: Starting download process.
2023-07-13T06:35:28.958139Z  INFO download: text_generation_launcher: Download file: model.safetensors

2023-07-13T06:35:30.635704Z  INFO download: text_generation_launcher: Downloaded /data/models--t5-base/snapshots/fe6d9bf207cd3337512ca838a8b453f87a9178ef/model.safetensors in 0:00:01.

2023-07-13T06:35:30.635867Z  INFO download: text_generation_launcher: Download: [1/1] -- ETA: 0

2023-07-13T06:35:31.326113Z  INFO text_generation_launcher: Successfully downloaded weights.
2023-07-13T06:35:31.326314Z  INFO text_generation_launcher: Starting shard 0
2023-07-13T06:35:35.984848Z  WARN shard-manager: text_generation_launcher: We're not using custom kernels.
 rank=0
2023-07-13T06:35:41.274660Z ERROR shard-manager: text_generation_launcher: Error when initializing model
Traceback (most recent call last):
  File "/opt/conda/bin/text-generation-server", line 8, in <module>
    sys.exit(app())
  File "/opt/conda/lib/python3.9/site-packages/typer/main.py", line 311, in __call__
    return get_command(self)(*args, **kwargs)
  File "/opt/conda/lib/python3.9/site-packages/click/core.py", line 1130, in __call__
    return self.main(*args, **kwargs)
  File "/opt/conda/lib/python3.9/site-packages/typer/core.py", line 778, in main
    return _main(
  File "/opt/conda/lib/python3.9/site-packages/typer/core.py", line 216, in _main
    rv = self.invoke(ctx)
  File "/opt/conda/lib/python3.9/site-packages/click/core.py", line 1657, in invoke
    return _process_result(sub_ctx.command.invoke(sub_ctx))
  File "/opt/conda/lib/python3.9/site-packages/click/core.py", line 1404, in invoke
    return ctx.invoke(self.callback, **ctx.params)
  File "/opt/conda/lib/python3.9/site-packages/click/core.py", line 760, in invoke
    return __callback(*args, **kwargs)
  File "/opt/conda/lib/python3.9/site-packages/typer/main.py", line 683, in wrapper
    return callback(**use_params)  # type: ignore
  File "/opt/conda/lib/python3.9/site-packages/text_generation_server/cli.py", line 78, in serve
    server.serve(
  File "/opt/conda/lib/python3.9/site-packages/text_generation_server/server.py", line 175, in serve
    asyncio.run(
  File "/opt/conda/lib/python3.9/asyncio/runners.py", line 44, in run
    return loop.run_until_complete(main)
  File "/opt/conda/lib/python3.9/asyncio/base_events.py", line 634, in run_until_complete
    self.run_forever()
  File "/opt/conda/lib/python3.9/asyncio/base_events.py", line 601, in run_forever
    self._run_once()
  File "/opt/conda/lib/python3.9/asyncio/base_events.py", line 1905, in _run_once
    handle._run()
  File "/opt/conda/lib/python3.9/asyncio/events.py", line 80, in _run
    self._context.run(self._callback, *self._args)
> File "/opt/conda/lib/python3.9/site-packages/text_generation_server/server.py", line 142, in serve_inner
    model = get_model(
  File "/opt/conda/lib/python3.9/site-packages/text_generation_server/models/__init__.py", line 279, in get_model
    return T5Sharded(
  File "/opt/conda/lib/python3.9/site-packages/text_generation_server/models/t5.py", line 70, in __init__
    model = T5ForConditionalGeneration(config, weights)
  File "/opt/conda/lib/python3.9/site-packages/text_generation_server/models/custom_modeling/t5_modeling.py", line 1035, in __init__
    self.lm_head = TensorParallelHead.load(
  File "/opt/conda/lib/python3.9/site-packages/text_generation_server/utils/layers.py", line 194, in load
    weight = weights.get_tensor(f"{prefix}.weight")
  File "/opt/conda/lib/python3.9/site-packages/text_generation_server/utils/weights.py", line 64, in get_tensor
    filename, tensor_name = self.get_filename(tensor_name)
  File "/opt/conda/lib/python3.9/site-packages/text_generation_server/utils/weights.py", line 51, in get_filename
    raise RuntimeError(f"weight {tensor_name} does not exist")
RuntimeError: weight lm_head.weight does not exist
chumpblocckami commented 1 year ago

Thanks @Narsil, it does work for me too with flan-t5, but I just tried with t5 and the problem seems to still occur.

2023-07-13T06:35:24.607879Z  INFO text_generation_launcher: Args { model_id: "t5-base", revision: None, sharded: None, num_shard: Some(1), quantize: None, dtype: None, trust_remote_code: false, max_concurrent_requests: 128, max_best_of: 2, max_stop_sequences: 4, max_input_length: 1024, max_total_tokens: 2048, waiting_served_ratio: 1.2, max_batch_prefill_tokens: 4096, max_batch_total_tokens: 16000, max_waiting_tokens: 20, hostname: "70904c856920", port: 80, shard_uds_path: "/tmp/text-generation-server", master_addr: "localhost", master_port: 29500, huggingface_hub_cache: Some("/data"), weights_cache_override: None, disable_custom_kernels: false, json_output: false, otlp_endpoint: None, cors_allow_origin: [], watermark_gamma: None, watermark_delta: None, ngrok: false, ngrok_authtoken: None, ngrok_domain: None, ngrok_username: None, ngrok_password: None, env: false }
2023-07-13T06:35:24.608058Z  INFO text_generation_launcher: Starting download process.
2023-07-13T06:35:28.958139Z  INFO download: text_generation_launcher: Download file: model.safetensors

2023-07-13T06:35:30.635704Z  INFO download: text_generation_launcher: Downloaded /data/models--t5-base/snapshots/fe6d9bf207cd3337512ca838a8b453f87a9178ef/model.safetensors in 0:00:01.

2023-07-13T06:35:30.635867Z  INFO download: text_generation_launcher: Download: [1/1] -- ETA: 0

2023-07-13T06:35:31.326113Z  INFO text_generation_launcher: Successfully downloaded weights.
2023-07-13T06:35:31.326314Z  INFO text_generation_launcher: Starting shard 0
2023-07-13T06:35:35.984848Z  WARN shard-manager: text_generation_launcher: We're not using custom kernels.
 rank=0
2023-07-13T06:35:41.274660Z ERROR shard-manager: text_generation_launcher: Error when initializing model
Traceback (most recent call last):
  File "/opt/conda/bin/text-generation-server", line 8, in <module>
    sys.exit(app())
  File "/opt/conda/lib/python3.9/site-packages/typer/main.py", line 311, in __call__
    return get_command(self)(*args, **kwargs)
  File "/opt/conda/lib/python3.9/site-packages/click/core.py", line 1130, in __call__
    return self.main(*args, **kwargs)
  File "/opt/conda/lib/python3.9/site-packages/typer/core.py", line 778, in main
    return _main(
  File "/opt/conda/lib/python3.9/site-packages/typer/core.py", line 216, in _main
    rv = self.invoke(ctx)
  File "/opt/conda/lib/python3.9/site-packages/click/core.py", line 1657, in invoke
    return _process_result(sub_ctx.command.invoke(sub_ctx))
  File "/opt/conda/lib/python3.9/site-packages/click/core.py", line 1404, in invoke
    return ctx.invoke(self.callback, **ctx.params)
  File "/opt/conda/lib/python3.9/site-packages/click/core.py", line 760, in invoke
    return __callback(*args, **kwargs)
  File "/opt/conda/lib/python3.9/site-packages/typer/main.py", line 683, in wrapper
    return callback(**use_params)  # type: ignore
  File "/opt/conda/lib/python3.9/site-packages/text_generation_server/cli.py", line 78, in serve
    server.serve(
  File "/opt/conda/lib/python3.9/site-packages/text_generation_server/server.py", line 175, in serve
    asyncio.run(
  File "/opt/conda/lib/python3.9/asyncio/runners.py", line 44, in run
    return loop.run_until_complete(main)
  File "/opt/conda/lib/python3.9/asyncio/base_events.py", line 634, in run_until_complete
    self.run_forever()
  File "/opt/conda/lib/python3.9/asyncio/base_events.py", line 601, in run_forever
    self._run_once()
  File "/opt/conda/lib/python3.9/asyncio/base_events.py", line 1905, in _run_once
    handle._run()
  File "/opt/conda/lib/python3.9/asyncio/events.py", line 80, in _run
    self._context.run(self._callback, *self._args)
> File "/opt/conda/lib/python3.9/site-packages/text_generation_server/server.py", line 142, in serve_inner
    model = get_model(
  File "/opt/conda/lib/python3.9/site-packages/text_generation_server/models/__init__.py", line 279, in get_model
    return T5Sharded(
  File "/opt/conda/lib/python3.9/site-packages/text_generation_server/models/t5.py", line 70, in __init__
    model = T5ForConditionalGeneration(config, weights)
  File "/opt/conda/lib/python3.9/site-packages/text_generation_server/models/custom_modeling/t5_modeling.py", line 1035, in __init__
    self.lm_head = TensorParallelHead.load(
  File "/opt/conda/lib/python3.9/site-packages/text_generation_server/utils/layers.py", line 194, in load
    weight = weights.get_tensor(f"{prefix}.weight")
  File "/opt/conda/lib/python3.9/site-packages/text_generation_server/utils/weights.py", line 64, in get_tensor
    filename, tensor_name = self.get_filename(tensor_name)
  File "/opt/conda/lib/python3.9/site-packages/text_generation_server/utils/weights.py", line 51, in get_filename
    raise RuntimeError(f"weight {tensor_name} does not exist")
RuntimeError: weight lm_head.weight does not exist

Try to update docker and run latest image:

docker pull ghcr.io/huggingface/text-generation-inference:latest docker run --gpus all --shm-size 1g -p 8080:80 -v $PWD/data:/data ghcr.io/huggingface/text-generation-inference:latest --model-id google/flan-t5-base --num-shard 2

r0n13 commented 1 year ago

Thanks @chumpblocckami - I did and it does work well with the flan-t5, but not with the 'regular' t5. You can reproduce by using:

docker run --shm-size 1g  \
-p 8080:80  \
--gpus all ghcr.io/huggingface/text-generation-inference:latest \
--model-id t5-base \
--num-shard 1
r0n13 commented 1 year ago

Shall I create a separate issue for this?

donglinz commented 1 year ago

The same issue happens for OPT

RuntimeError: weight model.decoder.embed_tokens.weight does not exist
saar-eliad commented 1 year ago

Got it too, server version 1.0.3 (using docker), and also with latest, Fails with facebook/opt-125m but worked for me with another model - gpt2.

Jacobsolawetz commented 2 months ago

Hello @Narsil - can we rebuild this if haven't yet

763104351884.dkr.ecr.us-east-2.amazonaws.com/huggingface-pytorch-tgi-inference:2.3.0-tgi2.0.2-gpu-py310-cu121-ubuntu22.04

cc @philschmid

Seeing this when working to deploy sagemaker LLama3-Instruct

from sagemaker.huggingface import HuggingFaceModel, get_huggingface_llm_image_uri

get_huggingface_llm_image_uri("huggingface",version="2.0.2")

{
  "_name_or_path": "meta-llama/Meta-Llama-3-8B-Instruct",
  "architectures": [
    "LlamaModel"
  ],
  "attention_bias": false,
  "attention_dropout": 0.0,
  "bos_token_id": 128000,
  "eos_token_id": 128009,
  "hidden_act": "silu",
  "hidden_size": 4096,
  "initializer_range": 0.02,
  "intermediate_size": 14336,
  "max_position_embeddings": 8192,
  "mlp_bias": false,
  "model_type": "llama",
  "num_attention_heads": 32,
  "num_hidden_layers": 32,
  "num_key_value_heads": 8,
  "pretraining_tp": 1,
  "rms_norm_eps": 1e-05,
  "rope_scaling": null,
  "rope_theta": 500000.0,
  "tie_word_embeddings": false,
  "torch_dtype": "float32",
  "transformers_version": "4.41.2",
  "use_cache": true,
  "vocab_size": 128256
}
philschmid commented 2 months ago

Hey @Jacobsolawetz,

Can you please try version 2.0.3?