huggingface / text-generation-inference

Large Language Model Text Generation Inference
http://hf.co/docs/text-generation-inference
Apache License 2.0
8.76k stars 1.02k forks source link

RuntimeError: "weight lm_head.weight does not exist" When Loading qwen2-0.5B-Instruct #2373

Open boyang-nlp opened 1 month ago

boyang-nlp commented 1 month ago

I'm experiencing an issue when loading the qwen2-0.5B-Instruct model with the TGI library. The error message thrown is "RuntimeError: weight lm_head.weight does not exist".

I suspect this might be due to the 'safetensors' file not preserving the 'tied' parameter. It seems that this can be avoided by untying 'lm_head' and 'embed_tokens' before invoking 'model.save_pretrained()'.

Interestingly, this problem doesn't occur with larger qwen models such as the 7B or 72B versions. I'm wondering if this is an expected behavior or an unintended bug.

I'm using the latest official image: ghcr.io/huggingface/text-generation-inference:2.2.0.

Traceback:


2024-08-07T16:32:00.053201Z  INFO shard-manager: text_generation_launcher: Starting shard rank=0

    return loop.run_until_complete(main)
  File "/opt/conda/lib/python3.10/asyncio/base_events.py", line 636, in run_until_complete
    self.run_forever()
  File "/opt/conda/lib/python3.10/asyncio/base_events.py", line 603, in run_forever
    self._run_once()
  File "/opt/conda/lib/python3.10/asyncio/base_events.py", line 1909, in _run_once
    handle._run()
  File "/opt/conda/lib/python3.10/asyncio/events.py", line 80, in _run
    self._context.run(self._callback, *self._args)
> File "/opt/conda/lib/python3.10/site-packages/text_generation_server/server.py", line 231, in serve_inner
    model = get_model(
  File "/opt/conda/lib/python3.10/site-packages/text_generation_server/models/__init__.py", line 953, in get_model
    return FlashCausalLM(
  File "/opt/conda/lib/python3.10/site-packages/text_generation_server/models/flash_causal_lm.py", line 898, in __init__
    model = model_class(prefix, config, weights)
  File "/opt/conda/lib/python3.10/site-packages/text_generation_server/models/custom_modeling/flash_qwen2_modeling.py", line 344, in __init__
    self.lm_head = SpeculativeHead.load(
  File "/opt/conda/lib/python3.10/site-packages/text_generation_server/layers/speculative.py", line 40, in load
    lm_head = TensorParallelHead.load(config, prefix, weights)
  File "/opt/conda/lib/python3.10/site-packages/text_generation_server/layers/tensor_parallel.py", line 67, in load
    weight = weights.get_tensor(f"{prefix}.weight")
  File "/opt/conda/lib/python3.10/site-packages/text_generation_server/utils/weights.py", line 212, in get_tensor
    filename, tensor_name = self.get_filename(tensor_name)
  File "/opt/conda/lib/python3.10/site-packages/text_generation_server/utils/weights.py", line 193, in get_filename
    raise RuntimeError(f"weight {tensor_name} does not exist")
RuntimeError: weight lm_head.weight does not exist

  File "/opt/conda/lib/python3.10/site-packages/text_generation_server/server.py", line 231, in serve_inner
    model = get_model(

  File "/opt/conda/lib/python3.10/asyncio/runners.py", line 44, in run
    return loop.run_until_complete(main)

  File "/opt/conda/lib/python3.10/asyncio/base_events.py", line 649, in run_until_complete
    return future.result()

  File "/opt/conda/lib/python3.10/site-packages/text_generation_server/server.py", line 231, in serve_inner
    model = get_model(

  File "/opt/conda/lib/python3.10/site-packages/text_generation_server/models/__init__.py", line 953, in get_model
    return FlashCausalLM(

  File "/opt/conda/lib/python3.10/site-packages/text_generation_server/models/flash_causal_lm.py", line 898, in __init__
    model = model_class(prefix, config, weights)

  File "/opt/conda/lib/python3.10/site-packages/text_generation_server/models/custom_modeling/flash_qwen2_modeling.py", line 344, in __init__
    self.lm_head = SpeculativeHead.load(

  File "/opt/conda/lib/python3.10/site-packages/text_generation_server/layers/speculative.py", line 40, in load
    lm_head = TensorParallelHead.load(config, prefix, weights)

  File "/opt/conda/lib/python3.10/site-packages/text_generation_server/layers/tensor_parallel.py", line 67, in load
    weight = weights.get_tensor(f"{prefix}.weight")

  File "/opt/conda/lib/python3.10/site-packages/text_generation_server/utils/weights.py", line 212, in get_tensor
    filename, tensor_name = self.get_filename(tensor_name)

  File "/opt/conda/lib/python3.10/site-packages/text_generation_server/utils/weights.py", line 193, in get_filename
    raise RuntimeError(f"weight {tensor_name} does not exist")

RuntimeError: weight lm_head.weight does not exist
 rank=0
2024-08-07T16:32:05.257583Z ERROR text_generation_launcher: Shard 0 failed to start
2024-08-07T16:32:05.257616Z  INFO text_generation_launcher: Shutting down shards
Error: ShardCannotStart```
ErikKaum commented 1 month ago

Hi @boyang-nlp 👋

Thanks for reporting this! I think we're a bit constrained on bandwidth at the moment but hopefully I can take a look this next week. If in the meantime you have time to dig deeper with possible solutions, please feel to post here 👍

boyang-nlp commented 1 month ago

@ErikKaum Absolutely! No rush on your end, take the time you need. I'll keep looking for solutions and update you. Thanks! 👍

anmolagarwalcp810 commented 1 month ago

Hi @boyang-nlp and @ErikKaum,

We were also facing this issue with Qwen2-1.5B and here is temporary fix (should also work for Qwen2-0.5B):

Open huggingface docker (ghcr.io/huggingface/text-generation-inference:2.2.0), and inside this, open speculative.py file.

vi /opt/conda/lib/python3.10/site-packages/text_generation_server/layers/speculative.py

And, inside that file, add the following lines at line 40 (enclosed between "FIX START" and "FIX END" comments):

import torch
...

class SpeculativeHead(torch.nn.Module):
    ...
    @staticmethod
    def load(config, prefix: str, weights):
        speculator = config.speculator
        if speculator:
            ...
        else:
            # FIX START
            if config._name_or_path == "Qwen/Qwen2-1.5B":
                if prefix == "lm_head":
                    prefix = "model.embed_tokens"
            # FIX END
            lm_head = TensorParallelHead.load(config, prefix, weights)
            speculator = None
        return SpeculativeHead(lm_head, speculator)
...

The above fix works because, for Qwen2-1.5B, embeddings are tied.

To verify this, we ran the following script:

from transformers import AutoModelForCausalLM, AutoTokenizer
import torch

device = "cuda" # the device to load the model onto
model = AutoModelForCausalLM.from_pretrained(
    "Qwen/Qwen2-1.5B",
    torch_dtype="auto",
    device_map="auto"
)
torch.all(model.model.embed_tokens.weight == model.lm_head.weight)

The output was:

tensor(True, device='cuda:0')

This means weights of lm_head are set equal to weights of model.embed_tokens