intel / neural-speed

An innovative library for efficient LLM inference via low-bit quantization
https://github.com/intel/neural-speed
Apache License 2.0
350 stars 38 forks source link

Int4 dequantize kernel #313

Closed zhewang1-intc closed 3 months ago

zhewang1-intc commented 4 months ago

Type of Change

feature or bug fix or documentation or others: feature API changed or not: add a new kernel

Description

int4 dequantize kernel with very high bandwidth utilization.

MTL: kernel bandwidth: ~85GB/s reported by VTune, hardware maximum bandwidth: ~85GB/s reported by clpeak, nearly 100% utilization;

ARC 770M: at least 90%+ vram bandwidth utilization.

Expected Behavior & Potential Risk

the expected behavior that triggered by this PR

How has this PR been tested?

how to reproduce the test (including hardware information)

Dependency Change?

any library dependency introduced or removed