Open tranlm opened 6 days ago
Hi @baijumeswani - I just want to confirm that I'm specifically running the example for dml.
The weights for the embedding and language modeling head (LM head) are similar as one is the transpose of the other. Some models that have very large vocabulary sizes tie the embedding and LM head weights together by saving one copy of the weights on disk. When the weights are tied, they can be stored either in the embedding or in the LM head.
The below code snippet sets the LM head's attributes from the embedding's attributes if not already set.
However, the reverse way to set the embedding's attributes from the LM head's attributes is not added. For LLaMA-3.2, it appears that the .safetensors
files store the embedding weights in model.lm_head.weight
instead of model.embed_tokens.weight
.
To temporarily unblock you, can you add the following in quantized_model.py
after the above code snippet?
# This is a copy of the above code snippet where references to `embedding` are replaced with `lm_head`
# and references to `lm_head` are replaced with `embedding`
# Set embedding weights + biases if not already set
if isinstance(self.embedding, TensorModule) and self.embedding.weight is None:
# LM head and embedding share same weights + biases (embedding.weight == lm_head.weight and embedding.bias == lm_head.bias)
self.embedding.weight = self.lm_head.weight
if self.embedding.bias is not None:
self.embedding.bias = self.lm_head.bias
The logic for handling the bias needs to be re-visited in both cases before merging a fix. In some models, the condition should be if bias is None
. In other models, the condition should be if bias is not None
. You can locally change the logic in both code snippets as needed to get the right weights and biases.
Describe the bug When I run the example from examples/python/awq-quantized-model.md, but switching out phi-3 for llama-3.2-3b, I get an error message stating that
AttributeError: 'NoneType' object has no attribute 'detach'
. However, when I use the extra_optionexclude_embeds=true
, the onnx conversion step runs successfully.To Reproduce Steps to reproduce the behavior:
model_name = "meta-llama/Llama-3.2-3B-Instruct"
Expected behavior The conversion to onnx should occur successfully, with no errors.
Screenshots
Desktop (please complete the following information):
Additional context I've manually tried loading the awq quantized model and it looks fine. I can see the embeddings and grab them by attribute as well. Here is the output when I exclude embeddings: