Use fp32 for K*Q in Metal FA implementation

ikawrakow / ik_llama.cpp

llama.cpp fork with additional SOTA quants and improved performance

MIT License

89 stars 6 forks source link

Closed ikawrakow closed 1 month ago

ikawrakow commented 1 month ago

Else some models (e.g., Qwen2-7B-Instruct) produce garbage. Borrowed from PR-9595 in mainline llama.cpp.

Strangely enough, K*Q is done using fp16 in my ARM_NEON FA implementation, and it works just fine there.