Does IPEX-LLM support Flash Attention ?

wallacezq commented 1 month ago

Hi, i encounter the following error message trying to enable flash attention when running the command below. Can i know is flash attention supported ?

command: ./main -m $model -n 128 --prompt "${prompt_1024_128}" -t 8 -e -ngl 999 --color --ctx-size 1024 --no-mmap --temp 0 -fa

-ggml_backend_sycl_graph_compute: <error: op not supported node_18 (FLASH_ATTN_EXT)

see truncated log below:

lm_load_print_meta: BOS token = 128000 '< begin_of_text >' llm_load_print_meta: EOS token = 128001 '< end_of_text >' llm_load_print_meta: LF token = 128 'Ä' llm_load_print_meta: EOT token = 128009 '< eot_id >' [SYCL] call ggml_init_sycl ggml_init_sycl: GGML_SYCL_DEBUG: 0 ggml_init_sycl: GGML_SYCL_F16: no found 4 SYCL devices: Max Max Global compute Max work sub mem ID Device Type Name Version units group group size Driver version

0 [level_zero:gpu:0] Intel Arc Graphics 1.3 128 1024 32 94386M 1.3.29735

1 [opencl:gpu:0] Intel Arc Graphics 3.0 128 1024 32 94386M 24.22.29735.20

2 [opencl:cpu:0] Intel Core Ultra 7 155H 3.0 22 8192 64 100912M 2023.16.12.0.12_195853.xmain-hotfix

3 [opencl:acc:0] Intel FPGA Emulation Device 1.2 22 67108864 64 100912M 2023.16.12.0.12_195853.xmain-hotfix

ggml_backend_sycl_set_mul_device_mode: true detect 1 SYCL GPUs: [0] with top Max compute units:128 llm_load_tensors: ggml ctx size = 0.30 MiB llm_load_tensors: offloading 32 repeating layers to GPU llm_load_tensors: offloading non-repeating layers to GPU llm_load_tensors: offloaded 33/33 layers to GPU llm_load_tensors: SYCL0 buffer size = 4403.49 MiB llm_load_tensors: SYCL_Host buffer size = 281.81 MiB ....................................................................................... llama_new_context_with_model: n_ctx = 1024 llama_new_context_with_model: n_batch = 1024 llama_new_context_with_model: n_ubatch = 1024 llama_new_context_with_model: flash_attn = 1 llama_new_context_with_model: freq_base = 500000.0 llama_new_context_with_model: freq_scale = 1 llama_kv_cache_init: SYCL0 KV buffer size = 128.00 MiB llama_new_context_with_model: KV self size = 128.00 MiB, K (f16): 64.00 MiB, V (f16): 64.00 MiB llama_new_context_with_model: SYCL_Host output buffer size = 0.49 MiB llama_new_context_with_model: SYCL0 compute buffer size = 517.00 MiB llama_new_context_with_model: SYCL_Host compute buffer size = 20.01 MiB llama_new_context_with_model: graph nodes = 903 llama_new_context_with_model: graph splits = 2 ggml_backend_sycl_graph_compute: error: op not supported node_18 (FLASH_ATTN_EXT) GGML_ASSERT: /home/runner/_work/llm.cpp/llm.cpp/llama-cpp-bigdl/ggml-sycl.cpp:17765: ok ./benchmark_llama-cpp.sh: line 24: 16616 Aborted (core dumped) ./main -m $model -n 128 --prompt "${promt_1024_128}" -t 8 -e -ngl 999 --color --ctx-size 1024 --no-mmap --temp 0 -fa

cyita commented 1 month ago

Hi, flash attention is not supported in llama.cpp with IPEX-LLM. We added optimized attention for sycl backend, and it is automatically turned on if the provided model is supported.

wallace-lee commented 1 month ago

Hi, flash attention is not supported in llama.cpp with IPEX-LLM. We added optimized attention for sycl backend, and it is automatically turned on if the provided model is supported.

I see. Can i know how would i know if a model can support optimized attention for sycl backend ?

cyita commented 1 month ago

Here is a list we already verified:

intel-analytics / ipex-llm

Does IPEX-LLM support Flash Attention ? #11578

lm_load_print_meta: BOS token = 128000 '<	begin_of_text	>' llm_load_print_meta: EOS token = 128001 '<	end_of_text	>' llm_load_print_meta: LF token = 128 'Ä' llm_load_print_meta: EOT token = 128009 '<	eot_id	>' [SYCL] call ggml_init_sycl ggml_init_sycl: GGML_SYCL_DEBUG: 0 ggml_init_sycl: GGML_SYCL_F16: no found 4 SYCL devices:
0	[level_zero:gpu:0]	Intel Arc Graphics	1.3	128	1024	32	94386M	1.3.29735
1	[opencl:gpu:0]	Intel Arc Graphics	3.0	128	1024	32	94386M	24.22.29735.20
2	[opencl:cpu:0]	Intel Core Ultra 7 155H	3.0	22	8192	64	100912M	2023.16.12.0.12_195853.xmain-hotfix
3	[opencl:acc:0]	Intel FPGA Emulation Device	1.2	22	67108864	64	100912M	2023.16.12.0.12_195853.xmain-hotfix