run_llama: [FAIL] XTILE_DATA request failed: -1: Invalid argument

intel / intel-extension-for-transformers

⚡ Build your chatbot within minutes on your favorite device; offer SOTA compression techniques for LLMs; run LLMs efficiently on Intel Platforms⚡

Apache License 2.0

2.13k stars 212 forks source link

run_llama: [FAIL] XTILE_DATA request failed: -1: Invalid argument #652

Closed cphoward closed 11 months ago

cphoward commented 11 months ago

I am trying to run the examples in "Run LLM with Python Script". I can quantize, but I cannot run inference with llama due the following error:

python scripts/run.py /path/to/llama/model --weight_dtype int4 -p "She opened the door and see"
...
run_llama: [FAIL]   XTILE_DATA request failed: -1: Invalid argument

How do I overcome this error?

I am running this on Xeon Sapphire Rapids on Debian 5.10.197-1 (2023-09-29) x86_64 GNU/Linux.

Oddly, I can run inference fine for "Chat with LLaMA2", but quantization does not work.

DDEle commented 11 months ago

Hi there, it seems that you are using a CPU with AMX instructions (probably SPR) with a Linux kernel without support of requesting AMX usage.

the code is to invoke a Linux system call to request access to Intel® AMX features. This is performed using the arch_prctl(2) based mechanism for applications to request usage of the Intel® AMX features. Specific information is described in the Linux kernel documentation.

https://www.intel.com/content/www/us/en/developer/articles/code-sample/advanced-matrix-extensions-intrinsics-functions.html (2nd section of Code sample walkthrough)

I also found some users reports that AMX is unavailable on VMs. Not sure if that is your case.

cphoward commented 11 months ago

I can confirm this indeed was due to the kernel lacking support. Using a kernel >= 5.16 did the trick. I was able to get this working on a VM.

See https://lwn.net/Articles/874846/ for kernel details.