intel / neural-speed

An innovative library for efficient LLM inference via low-bit quantization
https://github.com/intel/neural-speed
Apache License 2.0
350 stars 38 forks source link

Enable runtime gpu_arch auto-select based on devices where kernels are executing for gemm_int4 tests; enable device-specific compilation using USE_XETLA (xe_lpg, xe_hpg, xe_hpc). #302

Open qgao007 opened 5 months ago

qgao007 commented 5 months ago

Type of Change

Change #1: Enable runtime gpu_arch auto-select based on devices where kernels are executing for gemm_int4 tests. Change #2: enable device-specific compilation using USE_XETLA (xe_lpg, xe_hpg, xe_hpc) to address the current messy issue of "tests not matching device type"

Description

template template class to wrap <gpu_arch, mma_engine> for runtime gpu_arch auto-select based on devices where kernels are executing.

USE_XETLA options for compilation on different devices (xe_lpg, xe_hpg, xe_hpc).

Expected Behavior & Potential Risk

No foreseeable risk related to CMake Compliation / code execution.

How has this PR been tested?

tested on mtl/dg2

Dependency Change?

No Libraries changed.