Enable runtime gpu_arch auto-select based on devices where kernels are executing for gemm_int4 tests; enable device-specific compilation using USE_XETLA (xe_lpg, xe_hpg, xe_hpc).

Type of Change

Change #1: Enable runtime gpu_arch auto-select based on devices where kernels are executing for gemm_int4 tests. Change #2: enable device-specific compilation using USE_XETLA (xe_lpg, xe_hpg, xe_hpc) to address the current messy issue of "tests not matching device type"

Description

template template class to wrap <gpu_arch, mma_engine> for runtime gpu_arch auto-select based on devices where kernels are executing.

USE_XETLA options for compilation on different devices (xe_lpg, xe_hpg, xe_hpc).

Expected Behavior & Potential Risk

No foreseeable risk related to CMake Compliation / code execution.

How has this PR been tested?

tested on mtl/dg2

Dependency Change?

No Libraries changed.

intel / neural-speed