intel / linux-npu-driver

Intel® NPU (Neural Processing Unit) Driver
MIT License
190 stars 18 forks source link

umd test #15

Closed olegmikul closed 10 months ago

olegmikul commented 11 months ago

Hi,

I have Ubuntu 22.0.3 LTS and kernels 6.2, 6.5, and 6.6. I compiled NPU driver with compiler succesfully. My tests "ze_intel_vpu_tests" and "vpu_shared_tests" run succesfully, but "./vpu-umd-test" gives me errors: /home/username/npu-driver/validation/umd-test/testenv.hpp:26: Failure zeInit(ZE_INIT_FLAG_VPU_ONLY) Which is: 0x78000001 ZE_RESULT_SUCCESS Which is: 0x0

Please help. Thanks!

is-qian commented 11 months ago

I have encountered the same error. $ sudo dmesg | grep vpu [ 2.879541] intel_vpu 0000:00:0b.0: enabling device (0000 -> 0002) [ 2.901648] intel_vpu 0000:00:0b.0: [drm] Firmware: intel/vpu/vpu_37xx_v0.0.bin, version: 20231031MTL_CLIENT_SILICON-release2101ci_tag_mtl_pv2_vpu_rc_20231031_2101cb0b783368d [ 3.024300] [drm] Initialized intel_vpu 1.0.0 20230117 for 0000:00:0b.0 on minor 0 $ uname -a Linux qian 6.6.5-060605-generic #202312091251-Ubuntu SMP PREEMPT_DYNAMIC Sat Dec 9 14:02:34 UTC x86_64 x86_64 x86_64 GNU/Linux $ ./vpu-umd-test --config=basic.yaml [==========] Running 173 tests from 34 test suites. [----------] Global test environment set-up. /home/linux-npu-driver/validation/umd-test/testenv.hpp:26: Failure Expected equality of these values: zeInit(ZE_INIT_FLAG_VPU_ONLY) Which is: 0x78000001 ZE_RESULT_SUCCESS Which is: 0x0

jwludzik commented 10 months ago

If zeInit fails it mostly means that the user do not have an access to accel device set up by intel_vpu kernel module. Please, check it using:

# check the group that accel device owns (in below output it is render)
$ ls -lah /dev/accel/accel0
crw-rw---- 1 root render 261, 0 Jan 10 22:23 /dev/accel/accel0
# to display linux groups for your user
$ groups
adm cdrom sudo dip plugdev lpadmin lxd sambashare
# add a group to your user
$ usermod -a -G render <username>
# you need to logout and login back to apply a group change on your user

If render group is not set on accel device, you can set it by hand using chown command or by adding a rule to systemd for subsystem accel (see https://github.com/systemd/systemd/pull/27785 )

If it won't help, perhaps you are using incorrect libze_loader or libze_loader is not able to access libze_intel_vpu.so.1 in system. This should be possible to debug by analyzing strace -> strace -o log ./vpu-umd-test

olegmikul commented 10 months ago

Thanks, I've tried: $ ls -lah /dev/accel/accel0 crw------- 1 root root 261, 0 Jan 14 18:10 /dev/accel/accel0

Have added root group to my groups (even it is not recommended), and checked: groups uname: uname : uname root adm cdrom sudo dip plugdev render lpadmin lxd sambasharee

Still, same errors. Also changed group for render, crw------- 1 root render 261, 0 Jan 14 18:10 /dev/accel/accel0 still same errors.

Using strace will give me the same error message without further details. libze_intel_vpu.so.1 is on my LD_LIBRARY_PATH. I've installed libze_loader from independent intel git repo, let me try to get it internally as third party dependency at driver compiling.

olegmikul commented 10 months ago

Sorry, I was wrong about strace. Will debug it. Installing internal libze_loader doesn't help, the same error.

olegmikul commented 10 months ago

Ok, after ensuring changed permissions for /dev/accel/accel0, I was able to run tests - 145 passed, 24 skipped, 1 failed (Umd.ConfigurationCheck). @jwludzik - thanks for your help!

jwludzik commented 10 months ago

No problem! Closing the issue. Please, reopen or open a new one in case of doubt