Open kyang-06 opened 3 months ago
Thanks for pointing this out, we'll investigate
I've solved this issue. It is because neural compressor failed to recognize linear layers in customized llama attention.
Fixed by inherit nn.Linear
(along with modified __init__
) instead of nn.Module
at
https://github.com/intel/intel-npu-acceleration-library/blob/f04d499432fec85afd65532711224df7be76d6dc/intel_npu_acceleration_library/nn/linear.py#L16C1-L18C1
Describe the bug When loading TinyLlama or Llama-3-8B with dtype=int4, the model structure looks:
q_proj, kv_proj, o_proj
are allLinear
class in fp16 format, which is expectedQuantizedLinear
. I think it is a bug and has negtive impact to speed and quantization accuracy. Whereas Phi-3 int4 quantization works properlyTo Reproduce
Expected behavior
Desktop (please complete the following information):