oneapi-src / level-zero

oneAPI Level Zero Specification Headers and Loader
https://spec.oneapi.com/versions/latest/elements/l0/source/index.html
MIT License
211 stars 90 forks source link

`zesMemoryGetState` only works under root user #133

Closed notsyncing closed 5 months ago

notsyncing commented 7 months ago

Hello, I'm trying the free global memory query described here: https://github.com/intel/llvm/blob/sycl/sycl/doc/extensions/supported/sycl_ext_intel_device_info.md, which calls zesMemoryGetState under the hood. But I found that it always returns the total memory (16225243136 bytes, which is all my VRAM) as free memory under non-root user (already added to video and render group) even if something has occupied some VRAM, while under root user it correctly returned the free memory (10125414400 bytes).

btw, xpu-smi also always reports 0 MB of used memory under non-root user, while reporting the correct 6632 MB under root user.

Is this behavior by design or there was some bugs? Thanks!

environment:

Fedora Silverblue 39
linux kernel 6.6.13-200.fc39.x86_64
oneapi-basekit 2024.0
oneapi-level-zero 1.15.8-1.fc39.x86_64
eero-t commented 5 months ago

Please file bug against the L0 backend which you're using. Intel one is here: https://github.com/intel/compute-runtime/

And list which GPU kernel module you're using (upstream i915, i915 backport, Xe), as access rights are arbitrated by your kernel, not by user-space driver.

Upstream kernel documentation does not mention memory info being root-only: https://docs.kernel.org/gpu/driver-uapi.html#c.drm_i915_query_memory_regions

But kernel requires PERFMON capability for accessing some of the metrics. I don't think it should be needed for memory, but you could try whether that's enough instead of needing full root.

Are you doing this testing directly on host, or within a container (in which case UID mapping could be a problem)?

notsyncing commented 5 months ago

Are you doing this testing directly on host, or within a container (in which case UID mapping could be a problem)?

This happens both on the host and in a container.

But kernel requires PERFMON capability for accessing some of the metrics. I don't think it should be needed for memory, but you could try whether that's enough instead of needing full root.

After setcap "cap_perfmon=ep" xpu-smi, it can report memory info correctly under non-root user.

notsyncing commented 5 months ago

Upstream kernel documentation does not mention memory info being root-only: https://docs.kernel.org/gpu/driver-uapi.html#c.drm_i915_query_memory_regions

Interesting, I found this in the link you posted:

in struct drm_i915_memory_region_info:

unallocated_size

Estimate of memory remaining

Requires CAP_PERFMON or CAP_SYS_ADMIN to get reliable accounting. Without this (or if this is an older kernel) the value here will always equal the probed_size. Note this is only currently tracked for I915_MEMORY_CLASS_DEVICE regions (for other types the value here will always equal the probed_size).

It matches my observations perfectly. So it is actually by design. Thanks for your help!

eero-t commented 5 months ago

Note on PERFMON capability use in containers... While about any kernel version in supported distro versions is new enough to support it, some (enterprise) setups may still run so old Docker version that it does not have support for it, only for the older (and much wider) SYS_ADMIN capability.