intel-analytics / ipex-llm

Accelerate local LLM inference and finetuning (LLaMA, Mistral, ChatGLM, Qwen, Baichuan, Mixtral, Gemma, Phi, MiniCPM, etc.) on Intel CPU and GPU (e.g., local PC with iGPU, discrete GPU such as Arc, Flex and Max); seamlessly integrate with llama.cpp, Ollama, HuggingFace, LangChain, LlamaIndex, GraphRAG, DeepSpeed, vLLM, FastChat, Axolotl, etc.
Apache License 2.0
6.45k stars 1.24k forks source link

Does IPEX-LLM support Flash Attention ? #11578

Open wallacezq opened 1 month ago

wallacezq commented 1 month ago

Hi, i encounter the following error message trying to enable flash attention when running the command below. Can i know is flash attention supported ?

command: ./main -m $model -n 128 --prompt "${prompt_1024_128}" -t 8 -e -ngl 999 --color --ctx-size 1024 --no-mmap --temp 0 -fa

-ggml_backend_sycl_graph_compute: <error: op not supported node_18 (FLASH_ATTN_EXT)

see truncated log below:

lm_load_print_meta: BOS token = 128000 '< begin_of_text >' llm_load_print_meta: EOS token = 128001 '< end_of_text >' llm_load_print_meta: LF token = 128 'Ä' llm_load_print_meta: EOT token = 128009 '< eot_id >' [SYCL] call ggml_init_sycl ggml_init_sycl: GGML_SYCL_DEBUG: 0 ggml_init_sycl: GGML_SYCL_F16: no found 4 SYCL devices: Max Max Global compute Max work sub mem ID Device Type Name Version units group group size Driver version
0 [level_zero:gpu:0] Intel Arc Graphics 1.3 128 1024 32 94386M 1.3.29735
1 [opencl:gpu:0] Intel Arc Graphics 3.0 128 1024 32 94386M 24.22.29735.20
2 [opencl:cpu:0] Intel Core Ultra 7 155H 3.0 22 8192 64 100912M 2023.16.12.0.12_195853.xmain-hotfix
3 [opencl:acc:0] Intel FPGA Emulation Device 1.2 22 67108864 64 100912M 2023.16.12.0.12_195853.xmain-hotfix

ggml_backend_sycl_set_mul_device_mode: true detect 1 SYCL GPUs: [0] with top Max compute units:128 llm_load_tensors: ggml ctx size = 0.30 MiB llm_load_tensors: offloading 32 repeating layers to GPU llm_load_tensors: offloading non-repeating layers to GPU llm_load_tensors: offloaded 33/33 layers to GPU llm_load_tensors: SYCL0 buffer size = 4403.49 MiB llm_load_tensors: SYCL_Host buffer size = 281.81 MiB ....................................................................................... llama_new_context_with_model: n_ctx = 1024 llama_new_context_with_model: n_batch = 1024 llama_new_context_with_model: n_ubatch = 1024 llama_new_context_with_model: flash_attn = 1 llama_new_context_with_model: freq_base = 500000.0 llama_new_context_with_model: freq_scale = 1 llama_kv_cache_init: SYCL0 KV buffer size = 128.00 MiB llama_new_context_with_model: KV self size = 128.00 MiB, K (f16): 64.00 MiB, V (f16): 64.00 MiB llama_new_context_with_model: SYCL_Host output buffer size = 0.49 MiB llama_new_context_with_model: SYCL0 compute buffer size = 517.00 MiB llama_new_context_with_model: SYCL_Host compute buffer size = 20.01 MiB llama_new_context_with_model: graph nodes = 903 llama_new_context_with_model: graph splits = 2 ggml_backend_sycl_graph_compute: error: op not supported node_18 (FLASH_ATTN_EXT) GGML_ASSERT: /home/runner/_work/llm.cpp/llm.cpp/llama-cpp-bigdl/ggml-sycl.cpp:17765: ok ./benchmark_llama-cpp.sh: line 24: 16616 Aborted (core dumped) ./main -m $model -n 128 --prompt "${promt_1024_128}" -t 8 -e -ngl 999 --color --ctx-size 1024 --no-mmap --temp 0 -fa

cyita commented 1 month ago

Hi, flash attention is not supported in llama.cpp with IPEX-LLM. We added optimized attention for sycl backend, and it is automatically turned on if the provided model is supported.

wallace-lee commented 1 month ago

Hi, flash attention is not supported in llama.cpp with IPEX-LLM. We added optimized attention for sycl backend, and it is automatically turned on if the provided model is supported.

I see. Can i know how would i know if a model can support optimized attention for sycl backend ?

cyita commented 1 month ago

Here is a list we already verified:

image