intel / compute-runtime

Intel® Graphics Compute Runtime for oneAPI Level Zero and OpenCL™ Driver
MIT License
1.1k stars 229 forks source link

Assert with Xe KMD when using -DNEO_ENABLE_XE_DRM_DETECTION=TRUE #696

Closed eero-t closed 2 months ago

eero-t commented 5 months ago

Problem

Compute-runtime Xe KMD support does not actually work with Xe KMD, it asserts

Details

When building kernel from Xe repo default "drm-xe-next" branch (yesterday HEAD commit): https://gitlab.freedesktop.org/drm/xe/kernel

With Xe driver enabled:

# grep _XE[^A-Z] /boot/drm_xe.config 
CONFIG_DRM_XE=m
CONFIG_DRM_XE_FORCE_PROBE=""
CONFIG_DRM_XE_JOB_TIMEOUT_MAX=10000
CONFIG_DRM_XE_JOB_TIMEOUT_MIN=1
CONFIG_DRM_XE_TIMESLICE_MAX=10000000
CONFIG_DRM_XE_TIMESLICE_MIN=1
CONFIG_DRM_XE_PREEMPT_TIMEOUT=640000
CONFIG_DRM_XE_PREEMPT_TIMEOUT_MAX=10000000
CONFIG_DRM_XE_PREEMPT_TIMEOUT_MIN=1
CONFIG_DRM_XE_ENABLE_SCHEDTIMEOUT_LIMIT=y

Booting TGL device with it being enabled:

# dmesg | grep xe[^a-z]
[    0.000000] Command line: BOOT_IMAGE=/boot/drm_xe rootwait fsck.repair=yes i915.force_probe=!9a60 xe.force_probe=9a60 ro
[    0.038111] Kernel command line: BOOT_IMAGE=/boot/drm_xe rootwait fsck.repair=yes i915.force_probe=!9a60 xe.force_probe=9a60 ro
[    3.068875] xe 0000:00:02.0: vgaarb: deactivate vga console
[    3.198711] xe 0000:00:02.0: [drm] Using GuC firmware from i915/tgl_guc_70.bin version 70.13.1
[    3.202558] xe 0000:00:02.0: [drm] Using HuC firmware from i915/tgl_huc.bin version 7.9.3
[    3.204943] xe REG[0x2340-0x235f]: allow read access
[    3.204946] xe REG[0x7010-0x7017]: allow rw access
[    3.204947] xe REG[0x7018-0x701f]: allow rw access
[    3.204974] xe REG[0x223a8-0x223af]: allow read access
[    3.204993] xe REG[0x1c03a8-0x1c03af]: allow read access
[    3.205011] xe REG[0x1d03a8-0x1d03af]: allow read access
[    3.205030] xe REG[0x1c83a8-0x1c83af]: allow read access
[    3.212040] [drm] Initialized xe 1.1.0 20201103 for 0000:00:02.0 on minor 0
[    4.462524] xe 0000:00:02.0: [drm] GT0: suspended

And using compute stack built from following versions:

Using options enabling Xe KMD support:

ARG ZELLO_LOC=../level_zero/tools/test/black_box_tests/zello_sysman.cpp
RUN cd compute-runtime  &&  mkdir build  &&  cd build  &&  \
    cmake -LH -Wno-dev -G Ninja \
      -DCMAKE_INSTALL_PREFIX=${INSTALL_DIR} -DCMAKE_BUILD_TYPE=Release \
      -DSUPPORT_GEN8=0 -DSUPPORT_GEN9=1 -DSUPPORT_GEN11=0 \
      -DSUPPORT_TGLLP=1 -DSUPPORT_DG1=1 -DSUPPORT_XE_HP_SDV=1 \
      -DSUPPORT_DG2=1 -DSUPPORT_PVC=1 \
      -DNEO_ENABLE_i915_PRELIM_DETECTION=TRUE \
      -DNEO_ENABLE_XE_DRM_DETECTION=TRUE \
      -DNEO_DISABLE_LD_GOLD=1 \
      -DDO_NOT_RUN_AUB_TESTS=1 -DDONT_CARE_OF_VIRTUALS=1 \
      ../  && \
    ninja  &&  ninja install  && \
    g++ -O2 -Wall -o ${INSTALL_DIR}/bin/zello_sysman $ZELLO_LOC -lze_loader -locloc

Compute-runtime and its zello_sysman tool just abort with an assert:

# docker run -it --rm --user root --network none --cap-drop ALL  --device /dev/dri:/dev/dri:rw registry/compute-tester:latest zello_sysman
ZES_ENABLE_SYSMAN environment variable Not Set
Setting the environment variable ZES_ENABLE_SYSMAN 
ZES_ENABLE_SYSMAN environment variable Set
Abort was called at 311 line in file:
/source/compute-runtime/shared/source/os_interface/linux/xe/ioctl_helper_xe.cpp
eero-t commented 5 months ago

OpenCL programs give also same assert, which is here in the repo code: https://github.com/intel/compute-runtime/blob/23.48.27912.9/shared/source/os_interface/linux/xe/ioctl_helper_xe.cpp#L311

Strace shows this memory region check issue happening at driver init time:

# ... strace -f -k zello_sysman
...
write(1, "Abort was called at 311 line in "..., 38Abort was called at 311 line in file:
) = 38
 > /usr/lib/x86_64-linux-gnu/libc.so.6(__write+0x14) [0x10bf34]
...
 > /usr/lib/x86_64-linux-gnu/libc.so.6(__printf_chk+0xab) [0x12d63b]
 > /usr/local/lib/libze_intel_gpu.so.1.3.27912(zeKernelSuggestGroupSizeTracing+0x10e822) [0x3463b2]
... 
> /usr/local/lib/libze_intel_gpu.so.1.3.27912(zeKernelSuggestGroupSizeTracing+0x36949d) [0x5a102d]
 > /usr/local/lib/libze_intel_gpu.so.1.3.27912(zetGetMetricGroupExpProcAddrTable+0x22e86) [0x11b506]
 > /usr/local/lib/libze_intel_gpu.so.1.3.27912(zetGetMetricGroupExpProcAddrTable+0x227af) [0x11ae2f]
 > /usr/lib/x86_64-linux-gnu/libc.so.6(pthread_mutexattr_settype+0x107) [0x94817]
 > /usr/local/lib/libze_intel_gpu.so.1.3.27912(zetGetMetricGroupExpProcAddrTable+0x22a38) [0x11b0b8]
 > /usr/local/lib/libze_tracing_layer.so.1.15.8(zeGetFabricVertexExpProcAddrTable+0xdc5) [0xe835]
 > /usr/local/lib/libze_loader.so.1.15.8(loader::context_t::init_driver(loader::driver_t, unsigned int)+0x61d) [0x1f9bd]
 > /usr/local/lib/libze_loader.so.1.15.8(loader::context_t::check_drivers(unsigned int)+0x126) [0x219e6]
 > /usr/local/lib/libze_loader.so.1.15.8(ze_lib::context_t::~context_t()+0xc0) [0x1a170]
 > /usr/local/lib/libze_loader.so.1.15.8(loader::createLoaderContext()+0x174) [0x117a4]
 > /usr/lib/x86_64-linux-gnu/libc.so.6(pthread_mutexattr_settype+0x107) [0x94817]
 > /usr/local/lib/libze_loader.so.1.15.8(zeInit+0x73) [0x11853]
 > /usr/local/bin/zello_sysman() [0xa658]
eero-t commented 5 months ago

On Arc, I've seen also segfault instead of assert, but it was not reproducible. Strace showed it happening with same backtrace as the assert.

With OpenCL, strace shows line 311 assert being arrived through another route than in above zello_sysman L0 backend backtrace:

ioctl(4, _IOC(_IOC_READ|_IOC_WRITE, 0x64, 0x40, 0x28), 0x7ffe75cd81f0) = 0
 > /usr/lib/x86_64-linux-gnu/libc.so.6(ioctl+0x3f) [0x111f3f]
 > /usr/local/lib/intel-opencl/libigdrcl.so(GTPin_Init+0x505bba) [0x5ccb0a]
 > /usr/local/lib/intel-opencl/libigdrcl.so(GTPin_Init+0x51d7b7) [0x5e4707]
 > /usr/local/lib/intel-opencl/libigdrcl.so(GTPin_Init+0x51a22f) [0x5e117f]
 > /usr/local/lib/intel-opencl/libigdrcl.so(GTPin_Init+0x4fc8f5) [0x5c3845]
 > /usr/local/lib/intel-opencl/libigdrcl.so(GTPin_Init+0x1be9b) [0xe2deb]
 > /usr/local/lib/intel-opencl/libigdrcl.so(GTPin_Init+0x5105b0) [0x5d7500]
 > /usr/local/lib/intel-opencl/libigdrcl.so(GTPin_Init+0x464df7) [0x52bd47]
 > /usr/local/lib/intel-opencl/libigdrcl.so() [0x9b121]
 > /usr/local/lib/intel-opencl/libigdrcl.so() [0x9b2ae]
 > /usr/local/lib/intel-opencl/libigdrcl.so(GTPin_Init+0x46504d) [0x52bf9d]
 > /usr/local/lib/intel-opencl/libigdrcl.so(clGetExtensionFunctionAddress+0x5a6b) [0xbf7fb]
 > /usr/local/lib/intel-opencl/libigdrcl.so(clIcdGetPlatformIDsKHR+0x27) [0xbfe27]
 > /usr/lib/x86_64-linux-gnu/libOpenCL.so.1.0.0() [0x7f64]
 > /usr/lib/x86_64-linux-gnu/libOpenCL.so.1.0.0(clGetPlatformIDs+0xbb) [0x8f6b]
 > /usr/bin/clinfo() [0x97cc]
 > /usr/lib/x86_64-linux-gnu/libc.so.6(__libc_init_first+0x90) [0x23a90]
 > /usr/lib/x86_64-linux-gnu/libc.so.6(__libc_start_main+0x89) [0x23b49]
 > /usr/bin/clinfo() [0xc645]
newfstatat(1, "", {st_mode=S_IFCHR|0620, st_rdev=makedev(0x88, 0), ...}, AT_EMPTY_PATH) = 0
 > /usr/lib/x86_64-linux-gnu/libc.so.6(fstatat+0xe) [0x10b42e]
 > /usr/lib/x86_64-linux-gnu/libc.so.6(_IO_file_doallocate+0x63) [0x78603]
 > /usr/lib/x86_64-linux-gnu/libc.so.6(_IO_doallocbuf+0x50) [0x885b0]
 > /usr/lib/x86_64-linux-gnu/libc.so.6(_IO_file_overflow+0x180) [0x87510]
 > /usr/lib/x86_64-linux-gnu/libc.so.6(_IO_file_xsputn+0x105) [0x85ce5]
 > /usr/lib/x86_64-linux-gnu/libc.so.6(parse_printf_format+0x969) [0x56929]
 > /usr/lib/x86_64-linux-gnu/libc.so.6(parse_printf_format+0x605) [0x565c5]
 > /usr/lib/x86_64-linux-gnu/libc.so.6(parse_printf_format+0xbcc) [0x56b8c]
 > /usr/lib/x86_64-linux-gnu/libc.so.6(_IO_vfprintf+0x24e5) [0x5ece5]
 > /usr/lib/x86_64-linux-gnu/libc.so.6(_IO_vfprintf+0x4341) [0x60b41]
 > /usr/lib/x86_64-linux-gnu/libc.so.6(__printf_chk+0xab) [0x12d63b]
 > /usr/local/lib/intel-opencl/libigdrcl.so() [0x9b582]
 > /usr/local/lib/intel-opencl/libigdrcl.so(GTPin_Init+0x51a555) [0x5e14a5]
 > /usr/local/lib/intel-opencl/libigdrcl.so(GTPin_Init+0x4fc8f5) [0x5c3845]
 > /usr/local/lib/intel-opencl/libigdrcl.so(GTPin_Init+0x1be9b) [0xe2deb]
 > /usr/local/lib/intel-opencl/libigdrcl.so(GTPin_Init+0x5105b0) [0x5d7500]
 > /usr/local/lib/intel-opencl/libigdrcl.so(GTPin_Init+0x464df7) [0x52bd47]
 > /usr/local/lib/intel-opencl/libigdrcl.so() [0x9b121]
 > /usr/local/lib/intel-opencl/libigdrcl.so() [0x9b2ae]
 > /usr/local/lib/intel-opencl/libigdrcl.so(GTPin_Init+0x46504d) [0x52bf9d]
 > /usr/local/lib/intel-opencl/libigdrcl.so(clGetExtensionFunctionAddress+0x5a6b) [0xbf7fb]
 > /usr/local/lib/intel-opencl/libigdrcl.so(clIcdGetPlatformIDsKHR+0x27) [0xbfe27]
 > /usr/lib/x86_64-linux-gnu/libOpenCL.so.1.0.0() [0x7f64]
 > /usr/lib/x86_64-linux-gnu/libOpenCL.so.1.0.0(clGetPlatformIDs+0xbb) [0x8f6b]
 > /usr/bin/clinfo() [0x97cc]
 > /usr/lib/x86_64-linux-gnu/libc.so.6(__libc_init_first+0x90) [0x23a90]
 > /usr/lib/x86_64-linux-gnu/libc.so.6(__libc_start_main+0x89) [0x23b49]
 > /usr/bin/clinfo() [0xc645]
write(1, "Abort was called at 311 line in "..., 38Abort was called at 311 line in file:
) = 38

Mesa driver works fine with this (last night) Xe KMD git version.

eero-t commented 5 months ago

Tried also older (Dec 21st) Xe KMD version recommended for media-driver in https://github.com/intel/media-driver/issues/1761

But compute-runtime tags 23.48.27912.9 and earlier series 23.43.27642.21 one (using older Xe uAPI I think), still fail at init with it:

$ NEOReadDebugKeys=1 PrintDebugSettings=1 PrintDebugMessages=1 zello_sysman
ZES_ENABLE_SYSMAN environment variable Not Set
Setting the environment variable ZES_ENABLE_SYSMAN 
ZES_ENABLE_SYSMAN environment variable Set
Non-default value of debug variable: PrintDebugSettings = 1
Non-default value of debug variable: PrintDebugMessages = 1
IoctlHelperXe::IoctlHelperXe
IoctlHelperXe::initialize
 -> IoctlHelperXe::getIoctlRequestValue 0xe
 -> IoctlHelperXe::getIoctlRequestValue 0xe
DRM_XE_QUERY_CONFIG_REV_AND_DEVICE_ID   0x19a60
  REV_ID                0x1
  DEVICE_ID             0x9a60
DRM_XE_QUERY_CONFIG_FLAGS           0
  DRM_XE_QUERY_CONFIG_FLAG_HAS_VRAM OFF
DRM_XE_QUERY_CONFIG_MIN_ALIGNMENT       0x1000
DRM_XE_QUERY_CONFIG_VA_BITS     0x30
 -> IoctlHelperXe::getIoctlRequestValue 0xe
 -> IoctlHelperXe::getIoctlRequestValue 0xe
 -> IoctlHelperXe::getIoctlRequestValue 0xe
 -> IoctlHelperXe::getIoctlRequestValue 0xe
 -> IoctlHelperXe::getIoctlRequestValue 0xe
 -> IoctlHelperXe::getIoctlRequestValue 0xe
 -> IoctlHelperXe::getDrmParamValue 0x26 QueryHwconfigTable
 => IoctlHelperXe::ioctl 0xe
 -> IoctlHelperXe::ioctl Query id=0x26 f=0x0 len=0 r=0
INFO: System Info query failed!
 -> IoctlHelperXe::getDrmParamValue 0x1b ParamHasExecSoftpin
 => IoctlHelperXe::ioctl 0x3
 -> IoctlHelperXe::ioctl Getparam 0x1b/0x1 r=0
 => IoctlHelperXe::ioctl 0xd
 -> IoctlHelperXe::ioctl GemContextSetparam r=0
 -> IoctlHelperXe::getIoctlRequestValue 0xe
 -> IoctlHelperXe::getIoctlRequestValue 0xe
 -> IoctlHelperXe::getIoctlRequestValue 0xe
 -> IoctlHelperXe::getIoctlRequestValue 0xe
Abort was called at 311 line in file:
/home/nobody/source/compute-runtime/shared/source/os_interface/linux/xe/ioctl_helper_xe.cpp
Aborted (core dumped)

Latest Mesa tag works with Xe KMD HEAD, and the linked media-driver bug tells the working combo for media.

So, what Xe KMD version compute-runtime needs?

eero-t commented 5 months ago

As latest "compute-runtime" tag (23.52.28202.14) included some Xe KMD uAPI support updates (08f7e7be18f17a8977a9c380faa6addee9d8cf83), I built latest of everything, and tried it with latest Xe KMD drm-xe-next upstreaming tag drm-xe-next-fixes-2024-01-16.

Although latest Mesa (release) and media-driver (master) now both work with that Xe KMD tag (without any additional patches), "compute-runtime" still aborts:

# strace -f -k clinfo
...
write(1, "Abort was called at 509 line in "..., 38Abort was called at 509 line in file:
) = 38
 > /usr/lib/x86_64-linux-gnu/libc.so.6(__write+0x14) [0x10bf34]
...
 > /usr/lib/x86_64-linux-gnu/libc.so.6(__printf_chk+0xab) [0x12d63b]
 > /usr/local/lib/intel-opencl/libigdrcl.so() [0x9e552]
 > /usr/local/lib/intel-opencl/libigdrcl.so(GTPin_Init+0x52805f) [0x5f24cf]
 > /usr/local/lib/intel-opencl/libigdrcl.so(GTPin_Init+0x52c71c) [0x5f6b8c]
 > /usr/local/lib/intel-opencl/libigdrcl.so(GTPin_Init+0x50ccce) [0x5d713e]
 > /usr/local/lib/intel-opencl/libigdrcl.so(GTPin_Init+0x1f1c6) [0xe9636]
 > /usr/local/lib/intel-opencl/libigdrcl.so(GTPin_Init+0x520290) [0x5ea700]
 > /usr/local/lib/intel-opencl/libigdrcl.so(GTPin_Init+0x4743c7) [0x53e837]
 > /usr/local/lib/intel-opencl/libigdrcl.so() [0x9e0f1]
 > /usr/local/lib/intel-opencl/libigdrcl.so() [0x9e27e]
 > /usr/local/lib/intel-opencl/libigdrcl.so(GTPin_Init+0x47461d) [0x53ea8d]
 > /usr/local/lib/intel-opencl/libigdrcl.so(clGetExtensionFunctionAddress+0x5a7b) [0xc2d2b]
 > /usr/local/lib/intel-opencl/libigdrcl.so(clIcdGetPlatformIDsKHR+0x27) [0xc3357]
 > /usr/lib/x86_64-linux-gnu/libOpenCL.so.1.0.0() [0x7f64]
 > /usr/lib/x86_64-linux-gnu/libOpenCL.so.1.0.0(clGetPlatformIDs+0xbb) [0x8f6b]
 > /usr/bin/clinfo() [0x97cc]

With what Xe KMD version, patches etc compute-runtime is supposed to work with? And which compute-runtime version, patches etc. I should use?

JablonskiMateusz commented 5 months ago

Hi @eero-t Could you try to build NEO as of https://github.com/intel/compute-runtime/commit/278ced35dc2d69323a9e2bd754e648fcdab62520 ?

eero-t commented 4 months ago

@JablonskiMateusz That commit seems to be only in master branch, not yet in any of the tagged versions:

$ git branch --contains 278ced3
* master

Similarly to media-driver, master build of compute-runtime does work with Xe KMD!

Actually, both of the drivers work with both of the KMD versions from f.d.o:

However, while basic CL stuff seems to work, all Sysman metric queries return ZE_RESULT_ERROR_UNINITIALIZED (according to zello_sysman), at least on TGL iGPU.

Is there something I need to use to get at least some Sysman metrics to work, or is Xe KMD still lacking all metric support?

PS. I think this ticket should be open until:

[1] corresponding media-driver README: https://github.com/intel/media-driver/blob/master/media_softlet/linux/common/os/xe/include/README.md

eero-t commented 4 months ago

However, while basic CL stuff seems to work, all Sysman metric queries return ZE_RESULT_ERROR_UNINITIALIZED (according to zello_sysman), at least on TGL iGPU.

Is there something I need to use to get at least some Sysman metrics to work, or is Xe KMD still lacking all metric support?

With ZELLO_SYSMAN_USE_ZESINIT=1 env var, zello_sysman reports frequency metrics for TGL iGPU with xe KMD.

(I.e. Sysman supports xe KMD only when zesInit() is used for initializing it instead of zeInit().)

However, when querying engine metrics, there's a segfault:

# ZELLO_SYSMAN_USE_ZESINIT=1 strace -f zello_sysman -e
...
write(1, " ----  Engine tests ---- \n", 26 ----  Engine tests ---- 
) = 26
futex(0x5650fe3c3eb8, FUTEX_WAKE_PRIVATE, 2147483647) = 0
openat(AT_FDCWD, "/sys/class/drm/card0/device/vendor", O_RDONLY) = 3
read(3, "0x8086\n", 8191)               = 7
close(3)                                = 0
openat(AT_FDCWD, "/sys/module/i915/agama_version", O_RDONLY) = -1 ENOENT (No such file or directory)
openat(AT_FDCWD, "/sys/module/i915/srcversion", O_RDONLY) = -1 ENOENT (No such file or directory)
openat(AT_FDCWD, "/sys/class/drm/card0/device/subsystem_vendor", O_RDONLY) = 3
read(3, "0x8086\n", 8191)               = 7
close(3)                                = 0
write(1, "Device UUID: 0 0 0 0 0 0 0 0 0 0"..., 46Device UUID: 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 
) = 46
--- SIGSEGV {si_signo=SIGSEGV, si_code=SEGV_MAPERR, si_addr=NULL} ---
+++ killed by SIGSEGV (core dumped) +++

Those 2 metrics types are only ones compute-runtime supports for iGPUs, but once that segfault is fixed, I'll try also the other xe provided Sysman metrics on some dGPU.

saik-intel commented 4 months ago

@eero-t we are looking into this and update you when fix is ready

eero-t commented 4 months ago

However, when querying engine metrics, there's a segfault:

Segfault on engine metrics query is specific to "zello_sysman" (built from same 2024-02-09 master branch sources as driver itself).

There's no crash with my own zesInit() using program with Xe KMD, engine metrics just do not work: https://github.com/intel/compute-runtime/issues/707

eero-t commented 4 months ago

Tried latest Xe KMD (6.8.0-rc3) tags:

Because latest "24.05.28454.10" release is still missing reguired https://github.com/intel/compute-runtime/commit/278ced35dc2d69323a9e2bd754e648fcdab62520 commit, I built again latest compute-runtime master.

In quick testing, driver build seemed to work OK with "drm-xe-next-2024-02-25" one, except for missing engine metrics regression, that happens also with i915, and zello_sysman crash, discussed above.

As to "drm-xe-fixes-2024-02-29" Xe KMD, OpenCL read/write/copy tester hanged both on TGL iGPU and Arc. When stracing the tester, it was either using 100% by constantly sched_yield()ing (TGL), or nanosleeping (Arc). For now, I'm assuming driver is not even supposed to work with that Xe KMD version...

saik-intel commented 2 months ago

with new release it is fixed, please close

eero-t commented 2 months ago

with new release it is fixed, please close

@saik-intel Haven't yet had time to verify latest release functionality. I'll try to do it before end of week.

eero-t commented 2 months ago

Closing. On quick testing (zello_sysman + cl-mem), latest release works both with Xe KMD repo "drm-xe-next-2024-02-25" tag, and last night "drm-tip" HEAD kernels.