intel / neural-speed

An innovative library for efficient LLM inference via low-bit quantization
https://github.com/intel/neural-speed
Apache License 2.0
350 stars 38 forks source link

XeTLA Zero-Passthrough #321

Closed DDEle closed 3 months ago

DDEle commented 3 months ago

Type of Change: Feature

API not changed

Description

Previously there is a bug with the mask load, so we have to achieve accuracy with extra overhead. This overhead will be removed in this PR.

Expected Behavior & Potential Risk

N/A

How has this PR been tested?

Internal IPEX CI

Performance on MTL

xetla barnch

[ RUN      ] XeTLA/FMHATest.kUseBiasOFF_kSeqLastOFF_bs1_hn32_hs128_qlen1_klen33
[kernel time]The maximum gflops(GPU_time) is 9.90242
[ RUN      ] XeTLA/FMHATest.kUseBiasOFF_kSeqLastON_bs1_hn32_hs128_qlen1_klen33
[kernel time]The maximum gflops(GPU_time) is 10.4184
[ RUN      ] XeTLA/FMHATest.kUseBiasON_kSeqLastOFF_bs1_hn32_hs128_qlen1_klen33
[kernel time]The maximum gflops(GPU_time) is 10.3561
[ RUN      ] XeTLA/FMHATest.kUseBiasON_kSeqLastON_bs1_hn32_hs128_qlen1_klen33
[kernel time]The maximum gflops(GPU_time) is 10.5452
[ RUN      ] XeTLA/FMHATest.kUseBiasOFF_kSeqLastOFF_bs1_hn32_hs128_qlen1_klen1023
[kernel time]The maximum gflops(GPU_time) is 51.7042
[ RUN      ] XeTLA/FMHATest.kUseBiasOFF_kSeqLastON_bs1_hn32_hs128_qlen1_klen1023
[kernel time]The maximum gflops(GPU_time) is 49.3453
[ RUN      ] XeTLA/FMHATest.kUseBiasON_kSeqLastOFF_bs1_hn32_hs128_qlen1_klen1023
[kernel time]The maximum gflops(GPU_time) is 49.573
[ RUN      ] XeTLA/FMHATest.kUseBiasON_kSeqLastON_bs1_hn32_hs128_qlen1_klen1023
[kernel time]The maximum gflops(GPU_time) is 52.3423

This PR

[ RUN      ] XeTLA/FMHATest.kUseBiasOFF_kSeqLastOFF_bs1_hn32_hs128_qlen1_klen33
[kernel time]The maximum gflops(GPU_time) is 12.8365
[ RUN      ] XeTLA/FMHATest.kUseBiasOFF_kSeqLastON_bs1_hn32_hs128_qlen1_klen33
[kernel time]The maximum gflops(GPU_time) is 10.3975
[ RUN      ] XeTLA/FMHATest.kUseBiasON_kSeqLastOFF_bs1_hn32_hs128_qlen1_klen33
[kernel time]The maximum gflops(GPU_time) is 11.3263
[ RUN      ] XeTLA/FMHATest.kUseBiasON_kSeqLastON_bs1_hn32_hs128_qlen1_klen33
[kernel time]The maximum gflops(GPU_time) is 10.0362
[ RUN      ] XeTLA/FMHATest.kUseBiasOFF_kSeqLastOFF_bs1_hn32_hs128_qlen1_klen1023
[kernel time]The maximum gflops(GPU_time) is 56.154
[ RUN      ] XeTLA/FMHATest.kUseBiasOFF_kSeqLastON_bs1_hn32_hs128_qlen1_klen1023
[kernel time]The maximum gflops(GPU_time) is 58.3497
[ RUN      ] XeTLA/FMHATest.kUseBiasON_kSeqLastOFF_bs1_hn32_hs128_qlen1_klen1023
[kernel time]The maximum gflops(GPU_time) is 55.4583
[ RUN      ] XeTLA/FMHATest.kUseBiasON_kSeqLastON_bs1_hn32_hs128_qlen1_klen1023
[kernel time]The maximum gflops(GPU_time) is 58.2443

Dependency Change?

No

DDEle commented 3 months ago

Ready to merge as internal IPEX PR merges.

sunjiweiswift commented 3 months ago

Using esimd's API has achieved significant performance improvements