Open boyang-nlp opened 1 month ago
Hi @boyang-nlp 👋
Thanks for reporting this! I think we're a bit constrained on bandwidth at the moment but hopefully I can take a look this next week. If in the meantime you have time to dig deeper with possible solutions, please feel to post here 👍
@ErikKaum Absolutely! No rush on your end, take the time you need. I'll keep looking for solutions and update you. Thanks! 👍
Hi @boyang-nlp and @ErikKaum,
We were also facing this issue with Qwen2-1.5B and here is temporary fix (should also work for Qwen2-0.5B):
Open huggingface docker (ghcr.io/huggingface/text-generation-inference:2.2.0), and inside this, open speculative.py
file.
vi /opt/conda/lib/python3.10/site-packages/text_generation_server/layers/speculative.py
And, inside that file, add the following lines at line 40
(enclosed between "FIX START" and "FIX END" comments):
import torch
...
class SpeculativeHead(torch.nn.Module):
...
@staticmethod
def load(config, prefix: str, weights):
speculator = config.speculator
if speculator:
...
else:
# FIX START
if config._name_or_path == "Qwen/Qwen2-1.5B":
if prefix == "lm_head":
prefix = "model.embed_tokens"
# FIX END
lm_head = TensorParallelHead.load(config, prefix, weights)
speculator = None
return SpeculativeHead(lm_head, speculator)
...
The above fix works because, for Qwen2-1.5B, embeddings are tied.
To verify this, we ran the following script:
from transformers import AutoModelForCausalLM, AutoTokenizer
import torch
device = "cuda" # the device to load the model onto
model = AutoModelForCausalLM.from_pretrained(
"Qwen/Qwen2-1.5B",
torch_dtype="auto",
device_map="auto"
)
torch.all(model.model.embed_tokens.weight == model.lm_head.weight)
The output was:
tensor(True, device='cuda:0')
This means weights of lm_head
are set equal to weights of model.embed_tokens
I'm experiencing an issue when loading the qwen2-0.5B-Instruct model with the TGI library. The error message thrown is "RuntimeError: weight lm_head.weight does not exist".
I suspect this might be due to the 'safetensors' file not preserving the 'tied' parameter. It seems that this can be avoided by untying 'lm_head' and 'embed_tokens' before invoking 'model.save_pretrained()'.
Interestingly, this problem doesn't occur with larger qwen models such as the 7B or 72B versions. I'm wondering if this is an expected behavior or an unintended bug.
I'm using the latest official image: ghcr.io/huggingface/text-generation-inference:2.2.0.
Traceback: