intel-analytics / ipex-llm

Accelerate local LLM inference and finetuning (LLaMA, Mistral, ChatGLM, Qwen, Baichuan, Mixtral, Gemma, Phi, etc.) on Intel CPU and GPU (e.g., local PC with iGPU, discrete GPU such as Arc, Flex and Max); seamlessly integrate with llama.cpp, Ollama, HuggingFace, LangChain, LlamaIndex, DeepSpeed, vLLM, FastChat, Axolotl, etc.
Apache License 2.0
6.26k stars 1.23k forks source link

MTL GPU driver not shown and GPU demo crashed on Linux #11460

Open lucshi opened 1 week ago

lucshi commented 1 week ago

HW: MTL with ARC iGPU OS: Ubuntu 22.04 Kernel: 6.5.0-41-generic Ref: https://github.com/intel-analytics/ipex-llm/blob/main/docs/mddocs/Quickstart/install_linux_gpu.md Problem1: cannot find GPU driver by sycl-ls. Problem2: demo.py crashed. Log is attached. ` intel-fw-gpu is already the newest version (2024.17.5-329~22.04).

intel-i915-dkms is already the newest version (1.24.2.17.240301.20+i29-1).

(llm) sdp@9049fa09fdbc:~$ source /opt/intel/oneapi/setvars.sh --force

:: initializing oneAPI environment ... -bash: BASH_VERSION = 5.1.16(1)-release args: Using "$@" for setvars.sh arguments: --force :: advisor -- latest :: ccl -- latest :: compiler -- latest :: dal -- latest :: debugger -- latest :: dev-utilities -- latest :: dnnl -- latest :: dpcpp-ct -- latest :: dpl -- latest :: ipp -- latest :: ippcp -- latest :: mkl -- latest :: mpi -- latest :: tbb -- latest :: vtune -- latest :: oneAPI environment initialized ::

(llm) sdp@9049fa09fdbc:~$ sycl-ls

[opencl:acc:0] Intel(R) FPGA Emulation Platform for OpenCL(TM), Intel(R) FPGA Emulation Device OpenCL 1.2 [2023.16.10.0.17_160000]

[opencl:cpu:1] Intel(R) OpenCL, Intel(R) Core(TM) Ultra 7 1003H OpenCL 3.0 (Build 0) [2023.16.10.0.17_160000] `

Demo crash: `[2:04 PM] Shi, Lei A (llm) sdp@9049fa09fdbc:~$ python demo.py

/home/sdp/miniforge3/envs/llm/lib/python3.11/site-packages/torchvision/io/image.py:13: UserWarning: Failed to load image Python extension: ''If you don't plan on using image functionality from torchvision.io, you can ignore this warning. Otherwise, there might be something wrong with your environment. Did you have libjpeg or libpng installed before building torchvision from source?

warn(

2024-06-27 22:51:53,784 - INFO - intel_extension_for_pytorch auto imported

/home/sdp/miniforge3/envs/llm/lib/python3.11/site-packages/huggingface_hub/file_download.py:1132: FutureWarning: resume_download is deprecated and will be removed in version 1.0.0. Downloads always resume when possible. If you want to force a new download, use force_download=True.

warnings.warn(

2024-06-27 22:51:54,304 - WARNING -

WARNING: You are currently loading Falcon using legacy code contained in the model repository. Falcon has now been fully ported into the Hugging Face transformers library. For the most up-to-date and high-performance version of the Falcon model code, please update to the latest version of transformers and then load the model without the trust_remote_code=True argument.

Loading checkpoint shards: 100%|███████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████| 2/2 [00:09<00:00, 4.95s/it]

2024-06-27 22:52:04,476 - INFO - Converting the current model to sym_int4 format......

LIBXSMM_VERSION: main_stable-1.17-3651 (25693763)

LIBXSMM_TARGET: adl [Intel(R) Core(TM) Ultra 7 1003H]

Registry and code: 13 MB

Command: python demo.py

Uptime: 17.979020 s

Segmentation fault (core dumped)

Loading checkpoint shards: 100%|███████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████| 2/2 [00:08<00:00, 4.25s/it]

2024-06-27 22:54:34,472 - INFO - Converting the current model to sym_int4 format......

[Detaching after vfork from child process 23148]

[New Thread 0x7fffb6fee640 (LWP 23152)]

[New Thread 0x7fffd57fb640 (LWP 23153)]

[New Thread 0x7fffd2ffa640 (LWP 23154)]

[New Thread 0x7fffd07f9640 (LWP 23155)]

[New Thread 0x7fffcdff8640 (LWP 23156)]

[New Thread 0x7fffcb7f7640 (LWP 23157)]

[New Thread 0x7fffc8ff6640 (LWP 23158)]

[New Thread 0x7fffc67f5640 (LWP 23159)]

[New Thread 0x7fffc3ff4640 (LWP 23160)]

[New Thread 0x7fffc17f3640 (LWP 23161)]

[New Thread 0x7fffbeff2640 (LWP 23162)]

[New Thread 0x7fffbe7f1640 (LWP 23163)]

[New Thread 0x7fffb9ff0640 (LWP 23164)]

[New Thread 0x7fffb77ef640 (LWP 23165)]

[New Thread 0x7ffecdf53640 (LWP 23166)]

Thread 1 "python" received signal SIGSEGV, Segmentation fault.

0x00007fff005f16ab in xpu::dpcpp::initGlobalDevicePoolState() () from /home/sdp/miniforge3/envs/llm/lib/python3.11/site-packages/intel_extension_for_pytorch/lib/libintel-ext-pt-gpu.so

(gdb) bt

0 0x00007fff005f16ab in xpu::dpcpp::initGlobalDevicePoolState() () from /home/sdp/miniforge3/envs/llm/lib/python3.11/site-packages/intel_extension_for_pytorch/lib/libintel-ext-pt-gpu.so

1 0x00007ffff7c99ee8 in pthread_once_slow (once_control=0x7fff13cbddd8 , init_routine=0x7fffe0cdad50 <once_proxy>) at ./nptl/pthread_once.c:116

2 0x00007fff005ee491 in xpu::dpcpp::dpcppGetDeviceCount(int*) () from /home/sdp/miniforge3/envs/llm/lib/python3.11/site-packages/intel_extension_for_pytorch/lib/libintel-ext-pt-gpu.so

3 0x00007fff005a8c52 in xpu::dpcpp::device_count()::{lambda()#1}::operator()() const ()

from /home/sdp/miniforge3/envs/llm/lib/python3.11/site-packages/intel_extension_for_pytorch/lib/libintel-ext-pt-gpu.so

4 0x00007fff005a8c18 in xpu::dpcpp::device_count() () from /home/sdp/miniforge3/envs/llm/lib/python3.11/site-packages/intel_extension_for_pytorch/lib/libintel-ext-pt-gpu.so

5 0x00007fffa23be0c8 in xpu::THPModule_initExtension(_object, _object) ()

from /home/sdp/miniforge3/envs/llm/lib/python3.11/site-packages/intel_extension_for_pytorch/lib/libintel-ext-pt-python.so

6 0x000055555573950e in cfunction_vectorcall_NOARGS (func=0x7fffa2410c20, args=, nargsf=, kwnames=)

at /usr/local/src/conda/python-3.11.9/Include/cpython/methodobject.h:52

7 0x000055555574eeac in _PyObject_VectorcallTstate (kwnames=, nargsf=, args=, callable=0x7fffa2410c20,

tstate=0x555555ad0998 <_PyRuntime+166328>) at /usr/local/src/conda/python-3.11.9/Include/internal/pycore_call.h:92

8 PyObject_Vectorcall (callable=0x7fffa2410c20, args=, nargsf=, kwnames=) at /usr/local/src/conda/python-3.11.9/Objects/call.c:299

9 0x00005555557423b6 in _PyEval_EvalFrameDefault (tstate=, frame=, throwflag=) at /usr/local/src/conda/python-3.11.9/Python/ceval.c:4769

10 0x0000555555765981 in _PyEval_EvalFrame (throwflag=0, frame=0x7ffff7fb07d0, tstate=0x555555ad0998 <_PyRuntime+166328>)

at /usr/local/src/conda/python-3.11.9/Include/internal/pycore_ceval.h:73

11 _PyEval_Vector (kwnames=, argcount=0, args=0x0, locals=0x0, func=, tstate=0x555555ad0998 <_PyRuntime+166328>)

at /usr/local/src/conda/python-3.11.9/Python/ceval.c:6434

12 _PyFunction_Vectorcall (func=, stack=0x0, nargsf=, kwnames=) at /usr/local/src/conda/python-3.11.9/Objects/call.c:393

13 0x0000555555730244 in _PyObject_VectorcallTstate (tstate=0x555555ad0998 <_PyRuntime+166328>, callable=0x7ffee5567380, args=, nargsf=,

kwnames=<optimized out>) at /usr/local/src/conda/python-3.11.9/Include/internal/pycore_call.h:92

14 0x00005555557fef1c in PyObject_CallMethod (obj=, name=, format=0x7fffa23d7aea "") at /usr/local/src/conda/python-3.11.9/Objects/call.c:627

15 0x00007fffa23bb48d in xpu::lazy_init() () from /home/sdp/miniforge3/envs/llm/lib/python3.11/site-packages/intel_extension_for_pytorch/lib/libintel-ext-pt-python.so

16 0x00007fff005a8d86 in xpu::dpcpp::current_device() () from /home/sdp/miniforge3/envs/llm/lib/python3.11/site-packages/intel_extension_for_pytorch/lib/libintel-ext-pt-gpu.so

17 0x00007fff005ad5b6 in xpu::dpcpp::impl::DPCPPGuardImpl::getDevice() const ()

from /home/sdp/miniforge3/envs/llm/lib/python3.11/site-packages/intel_extension_for_pytorch/lib/libintel-ext-pt-gpu.so

18 0x00007fffe29b274f in at::native::to(at::Tensor const&, c10::optional, c10::optional, c10::optional, c10::optional, bool, bool, c10::optional) () from /home/sdp/miniforge3/envs/llm/lib/python3.11/site-packages/torch/lib/libtorch_cpu.so

19 0x00007fffe37c3743 in c10::impl::wrap_kernel_functorunboxed<c10::impl::detail::WrapFunctionIntoFunctor_<c10::CompileTimeFunctionPointer<at::Tensor (at::Tensor const&, c10::optional, c10::optional, c10::optional, c10::optional, bool, bool, c10::optional), &at::(anonymous namespace)::(anonymous namespace)::wrapper_CompositeImplicitAutograd_dtype_layout_to>, at::Tensor, c10::guts::typelist::typelist<at::Tensor const&, c10::optional, c10::optional, c10::optional, c10::optional, bool, bool, c10::optional > >, at::Tensor (at::Tensor const&, c10::optional, c10::optional, c10::optional, c10::optional, bool, bool, c10::optional)>::call(c10::OperatorKernel*, c10::DispatchKeySet, at::Tensor const&, c10::optional, c10::optional, c10::optional, c10::optional, bool, bool, c10::optional) ()

from /home/sdp/miniforge3/envs/llm/lib/python3.11/site-packages/torch/lib/libtorch_cpu.so

20 0x00007fffe3049eea in at::_ops::to_dtype_layout::call(at::Tensor const&, c10::optional, c10::optional, c10::optional, c10::optional, bool, bool, c10::optional) () from /home/sdp/miniforge3/envs/llm/lib/python3.11/site-packages/torch/lib/libtorch_cpu.so

21 0x00007fffef1dfa19 in torch::autograd::dispatch_to(at::Tensor const&, c10::Device, bool, bool, c10::optional) ()

from /home/sdp/miniforge3/envs/llm/lib/python3.11/site-packages/torch/lib/libtorch_python.so

22 0x00007fffef24a8ec in torch::autograd::THPVariable_to(_object, _object, _object*) () from /home/sdp/miniforge3/envs/llm/lib/python3.11/site-packages/torch/lib/libtorch_python.so

23 0x000055555575f1c8 in method_vectorcall_VARARGS_KEYWORDS (func=0x7ffff7104360, args=0x7ffff7fb07a8, nargsf=, kwnames=)

at /usr/local/src/conda/python-3.11.9/Objects/descrobject.c:364

24 0x000055555574eeac in _PyObject_VectorcallTstate (kwnames=, nargsf=, args=, callable=0x7ffff7104360,

tstate=0x555555ad0998 <_PyRuntime+166328>) at /usr/local/src/conda/python-3.11.9/Include/internal/pycore_call.h:92

25 PyObject_Vectorcall (callable=0x7ffff7104360, args=, nargsf=, kwnames=) at /usr/local/src/conda/python-3.11.9/Objects/call.c:299

26 0x00005555557423b6 in _PyEval_EvalFrameDefault (tstate=, frame=, throwflag=) at /usr/local/src/conda/python-3.11.9/Python/ceval.c:4769

27 0x0000555555783fc2 in _PyEval_EvalFrame (throwflag=0, frame=0x7ffff7fb0140, tstate=0x555555ad0998 <_PyRuntime+166328>)

at /usr/local/src/conda/python-3.11.9/Include/internal/pycore_ceval.h:73

28 _PyEval_Vector (kwnames=, argcount=, args=0x7fffffffc7a0, locals=0x0, func=0x7fffa85d6c00, tstate=0x555555ad0998 <_PyRuntime+166328>)

at /usr/local/src/conda/python-3.11.9/Python/ceval.c:6434

29 _PyFunction_Vectorcall (kwnames=, nargsf=, stack=0x7fffffffc7a0, func=0x7fffa85d6c00) at /usr/local/src/conda/python-3.11.9/Objects/call.c:393

30 _PyObject_VectorcallTstate (kwnames=, nargsf=, args=0x7fffffffc7a0, callable=0x7fffa85d6c00, tstate=0x555555ad0998 <_PyRuntime+166328>)

at /usr/local/src/conda/python-3.11.9/Include/internal/pycore_call.h:92

31 method_vectorcall (method=, args=, nargsf=, kwnames=) at /usr/local/src/conda/python-3.11.9/Objects/classobject.c:89

--Type for more, q to quit, c to continue without paging--

`

After reinstall level-zero. Crash changed to "killed"

lucshi commented 3 days ago

root cause has been identified by Qiu,Xin that the driver is not properly installed. But the MTL is too new and currently no good way to install the driver. After changing to Ubuntu 24 with kernel 6.8, the kernel driver seems to be installed, but sycl-ls still cannot show oneapi xxxx item.