intel / neural-speed

An innovative library for efficient LLM inference via low-bit quantization
https://github.com/intel/neural-speed
Apache License 2.0
350 stars 38 forks source link

BF16 Compute DType on AVX512 ISA #308

Open Alavandar08 opened 4 months ago

Alavandar08 commented 4 months ago

In the Bestla README.md Weight only Quantization supported config is provided - https://github.com/intel/neural-speed/blob/main/bestla/README.md#weight-only

As Bestla supports the BF16 compute DType, I have quantized the model using quantize.py - https://github.com/intel/neural-speed/blob/main/scripts/quantize.py

Ex: _python scripts/quantize.py --model_name llama2 --model_file ne-f32.bin --out_file ne-q4_j.bin --weight_dtype int4 --group_size 128 --computedtype bf16

During the Inference cycle, I noticed that for both FP32 and BF16 computation types F32 APIs are being triggered: One scenario is with in QKV fusion BTLAGemmCompF32() is triggered with both F32 and BF16 - https://github.com/intel/neural-speed/blob/97c8190897cb923d596c599bbadc535c3d4729fd/neural_speed/core/layers/ip_fusion_qkv.cpp#L71

Question 1: I would like to know, if I can use Bestla/Neural speed APIs for BF16 compute Dtype without falling back to F32 on AVX512 ISA and what about the input Dtype the API supports?