microsoft / onnxruntime-genai

Generative AI extensions for onnxruntime
MIT License
528 stars 131 forks source link

awq example runs into error with llama 3.2 3b due to embedding layer #1089

Open tranlm opened 6 days ago

tranlm commented 6 days ago

Describe the bug When I run the example from examples/python/awq-quantized-model.md, but switching out phi-3 for llama-3.2-3b, I get an error message stating that AttributeError: 'NoneType' object has no attribute 'detach'. However, when I use the extra_option exclude_embeds=true, the onnx conversion step runs successfully.

To Reproduce Steps to reproduce the behavior:

  1. Follow the example from examples/python/awq-quantized-model.md, but switching out for model_name = "meta-llama/Llama-3.2-3B-Instruct"
  2. At the onnx conversion step (after the quantization is complete), observe the error.

Expected behavior The conversion to onnx should occur successfully, with no errors.

Screenshots

(base) C:\Users\Tranl\Documents>python -m onnxruntime_genai.models.builder -i C:\Users\Tranl\Documents\Llama-3.2-3B-Instruct-quant -o C:\Users\Tranl\Documents\Llama-3.2-3B-Instruct-onnx -p int4 -e dml -c ..\cache_dir
Valid precision + execution provider combinations are: FP32 CPU, FP32 CUDA, FP16 CUDA, FP16 DML, INT4 CPU, INT4 CUDA, INT4 DML
GroupQueryAttention (GQA) is used in this model.
Unpacking and repacking layer 0
Unpacking and repacking layer 1
Unpacking and repacking layer 2
Unpacking and repacking layer 3
Unpacking and repacking layer 4
Unpacking and repacking layer 5
Unpacking and repacking layer 6
Unpacking and repacking layer 7
Unpacking and repacking layer 8
Unpacking and repacking layer 9
Unpacking and repacking layer 10
Unpacking and repacking layer 11
Unpacking and repacking layer 12
Unpacking and repacking layer 13
Unpacking and repacking layer 14
Unpacking and repacking layer 15
Unpacking and repacking layer 16
Unpacking and repacking layer 17
Unpacking and repacking layer 18
Unpacking and repacking layer 19
Unpacking and repacking layer 20
Unpacking and repacking layer 21
Unpacking and repacking layer 22
Unpacking and repacking layer 23
Unpacking and repacking layer 24
Unpacking and repacking layer 25
Unpacking and repacking layer 26
Unpacking and repacking layer 27
Reading embedding layer
Traceback (most recent call last):
  File "<frozen runpy>", line 198, in _run_module_as_main
  File "<frozen runpy>", line 88, in _run_code
  File "C:\Users\Tranl\AppData\Roaming\Python\Python312\site-packages\onnxruntime_genai\models\builder.py", line 3267, in <module>
    create_model(args.model_name, args.input, args.output, args.precision, args.execution_provider, args.cache_dir, **extra_options)
  File "C:\Users\Tranl\AppData\Roaming\Python\Python312\site-packages\onnxruntime_genai\models\builder.py", line 3151, in create_model
    onnx_model.make_model(input_path)
  File "C:\Users\Tranl\AppData\Roaming\Python\Python312\site-packages\onnxruntime_genai\models\builder.py", line 2058, in make_model
    self.make_embedding(module.weight.detach().numpy())
                        ^^^^^^^^^^^^^^^^^^^^
AttributeError: 'NoneType' object has no attribute 'detach'

Desktop (please complete the following information):

Additional context I've manually tried loading the awq quantized model and it looks fine. I can see the embeddings and grab them by attribute as well. Here is the output when I exclude embeddings:

(base) C:\Users\Tranl\Documents>python -m onnxruntime_genai.models.builder -i C:\Users\Tranl\Documents\Llama-3.2-3B-Instruct-quant -o C:\Users\Tranl\Documents\Llama-3.2-3B-Instruct-onnx -p int4 -e dml -c ..\cache_dir --extra_options exclude_embeds=true
Valid precision + execution provider combinations are: FP32 CPU, FP32 CUDA, FP16 CUDA, FP16 DML, INT4 CPU, INT4 CUDA, INT4 DML
Extra options: {'exclude_embeds': 'true'}
GroupQueryAttention (GQA) is used in this model.
Unpacking and repacking layer 0
Unpacking and repacking layer 1
Unpacking and repacking layer 2
Unpacking and repacking layer 3
Unpacking and repacking layer 4
Unpacking and repacking layer 5
Unpacking and repacking layer 6
Unpacking and repacking layer 7
Unpacking and repacking layer 8
Unpacking and repacking layer 9
Unpacking and repacking layer 10
Unpacking and repacking layer 11
Unpacking and repacking layer 12
Unpacking and repacking layer 13
Unpacking and repacking layer 14
Unpacking and repacking layer 15
Unpacking and repacking layer 16
Unpacking and repacking layer 17
Unpacking and repacking layer 18
Unpacking and repacking layer 19
Unpacking and repacking layer 20
Unpacking and repacking layer 21
Unpacking and repacking layer 22
Unpacking and repacking layer 23
Unpacking and repacking layer 24
Unpacking and repacking layer 25
Unpacking and repacking layer 26
Unpacking and repacking layer 27
Reading decoder layer 0
Reading decoder layer 1
Reading decoder layer 2
Reading decoder layer 3
Reading decoder layer 4
Reading decoder layer 5
Reading decoder layer 6
Reading decoder layer 7
Reading decoder layer 8
Reading decoder layer 9
Reading decoder layer 10
Reading decoder layer 11
Reading decoder layer 12
Reading decoder layer 13
Reading decoder layer 14
Reading decoder layer 15
Reading decoder layer 16
Reading decoder layer 17
Reading decoder layer 18
Reading decoder layer 19
Reading decoder layer 20
Reading decoder layer 21
Reading decoder layer 22
Reading decoder layer 23
Reading decoder layer 24
Reading decoder layer 25
Reading decoder layer 26
Reading decoder layer 27
Reading final norm
Reading LM head
Saving ONNX model in C:\Users\Tranl\Documents\Llama-3.2-3B-Instruct-onnx
2024-11-21 21:16:27,418 onnxruntime.quantization.matmul_4bits_quantizer [INFO] - skip to quantize /model/constant_nodes/TensorProto.INT64/1D/1 ...
<...etc>
<...etc>
<...etc>
2024-11-21 21:16:27,441 onnxruntime.quantization.matmul_4bits_quantizer [INFO] - skip to quantize /model/layers.28/final_norm_layernorm/SkipLayerNorm ...
2024-11-21 21:16:27,441 onnxruntime.quantization.matmul_4bits_quantizer [INFO] - start to quantize /lm_head/MatMul ...
2024-11-21 21:16:28,155 onnxruntime.quantization.matmul_4bits_quantizer [INFO] - complete quantization of /lm_head/MatMul ...
Saving GenAI config in C:\Users\Tranl\Documents\Llama-3.2-3B-Instruct-onnx
Saving processing files in C:\Users\Tranl\Documents\Llama-3.2-3B-Instruct-onnx for GenAI
tranlm commented 5 days ago

Hi @baijumeswani - I just want to confirm that I'm specifically running the example for dml.

kunal-vaishnavi commented 5 days ago

The weights for the embedding and language modeling head (LM head) are similar as one is the transpose of the other. Some models that have very large vocabulary sizes tie the embedding and LM head weights together by saving one copy of the weights on disk. When the weights are tied, they can be stored either in the embedding or in the LM head.

The below code snippet sets the LM head's attributes from the embedding's attributes if not already set.

https://github.com/microsoft/onnxruntime-genai/blob/17061e0b14a53b2e2a0a202f6cc15964ae2605b1/src/python/py/models/quantized_model.py#L340-L345

However, the reverse way to set the embedding's attributes from the LM head's attributes is not added. For LLaMA-3.2, it appears that the .safetensors files store the embedding weights in model.lm_head.weight instead of model.embed_tokens.weight.

To temporarily unblock you, can you add the following in quantized_model.py after the above code snippet?

# This is a copy of the above code snippet where references to `embedding` are replaced with `lm_head`
# and references to `lm_head` are replaced with `embedding`

# Set embedding weights + biases if not already set
if isinstance(self.embedding, TensorModule) and self.embedding.weight is None:
    # LM head and embedding share same weights + biases (embedding.weight == lm_head.weight and embedding.bias == lm_head.bias)
    self.embedding.weight = self.lm_head.weight
    if self.embedding.bias is not None:
        self.embedding.bias = self.lm_head.bias

The logic for handling the bias needs to be re-visited in both cases before merging a fix. In some models, the condition should be if bias is None. In other models, the condition should be if bias is not None. You can locally change the logic in both code snippets as needed to get the right weights and biases.