intel / xpumanager

MIT License
95 stars 20 forks source link

xpumanager xpumd container fails with errors #64

Closed vbedida79 closed 1 year ago

vbedida79 commented 1 year ago

Hi, for Intel Data Center GPU Flex 140, on OCP- with the Intel device plugins operator GPU plugin, xpumanager daemonset and xpumanager_side car it fails with error below. Used the kustomization yaml with xpumanager master branch, v1.2.18 latest release and v1.2.13 for the docker image intel/xpumanager:v1.2.13 tag.

[2023-09-12 18:50:09.947] [I] [1-1] XPUM: Init xpum library
[2023-09-12 18:50:09.947] [I] [1-1] XPU Manager:        1.2.13.20230629
[2023-09-12 18:50:09.947] [I] [1-1] Build:              aeeedfec
[2023-09-12 18:50:09.947] [I] [1-1] Level Zero: 1.9.0
[2023-09-12 18:50:09.947] [I] [1-1] xpumd core starts to initialize
[2023-09-12 18:50:09.947] [I] [1-1] initialize configuration
[2023-09-12 18:50:09.947] [I] [1-1] xpum mode: xpum
[2023-09-12 18:50:09.947] [I] [1-1] The environment variable XPUM_METRICS is detected: 0-38
[2023-09-12 18:50:09.947] [I] [1-1] initialize datalogic
[2023-09-12 18:50:09.947] [I] [1-1] initialize device manager
[2023-09-12 18:50:09.975] [E] [1-1] Failed to load msr kernel module
sh: 1: modprobe: not found
[2023-09-12 18:50:10.815] [W] [1-25] Device Intel(R) Data Center GPU Flex 1400000:3c:00.0 has no Memory Temperature capability.
[2023-09-12 18:50:10.815] [W] [1-25] Capability Memory Temperature detection returned: No temperature sensor detected
[2023-09-12 18:50:10.815] [W] [1-24] Device Intel(R) Data Center GPU Flex 1400000:37:00.0 has no Memory Temperature capability.
[2023-09-12 18:50:10.815] [W] [1-24] Capability Memory Temperature detection returned: No temperature sensor detected
[2023-09-12 18:50:10.815] [W] [1-25] Device Intel(R) Data Center GPU Flex 1400000:3c:00.0 has no Memory Bandwidth capability.
[2023-09-12 18:50:10.815] [W] [1-25] Capability Memory Bandwidth detection returned: [toGetMemoryBandwidth:1978] zesMemoryGetBandwidth-1:0x78000003
[2023-09-12 18:50:10.815] [W] [1-25] Device Intel(R) Data Center GPU Flex 1400000:3c:00.0 has no Memory Read Write Throughput capability.
[2023-09-12 18:50:10.815] [W] [1-25] Capability Memory Read Write Throughput detection returned: [toGetMemoryReadWrite:2056] zesMemoryGetBandwidth:0x78000003
[2023-09-12 18:50:10.815] [W] [1-24] Device Intel(R) Data Center GPU Flex 1400000:37:00.0 has no Memory Bandwidth capability.
[2023-09-12 18:50:10.815] [W] [1-24] Capability Memory Bandwidth detection returned: [toGetMemoryBandwidth:1978] zesMemoryGetBandwidth-1:0x78000003
[2023-09-12 18:50:10.815] [W] [1-24] Device Intel(R) Data Center GPU Flex 1400000:37:00.0 has no Memory Read Write Throughput capability.
[2023-09-12 18:50:10.815] [W] [1-24] Capability Memory Read Write Throughput detection returned: [toGetMemoryReadWrite:2056] zesMemoryGetBandwidth:0x78000003
malloc(): unaligned tcache chunk detected

Is it recommended to build specific release image from scratch to deploy? Or any specific requirements that I missed in the deployment? Thank you!

donzh commented 1 year ago

What happens if you only run xpumanager container on your physical machine? Just like description here: https://hub.docker.com/r/intel/xpumanager .

vbedida79 commented 1 year ago

Unfortunately, since the setup is on OpenShift with RedHat Core OS node, its restrictive to deploy anything on the host node. So we deploy it via container and schedule it on Flex node with GPU plugin. Also it runs as a privileged container, if that could help. Anything else we could check via container?

fmiao2372 commented 1 year ago

Please change SPDLOG_LEVEL to trace and provide more detailed logs. https://github.com/intel/xpumanager/blob/master/deployment/kubernetes/daemonset-intel-xpum.yaml#L31

vbedida79 commented 1 year ago

found the issue, seems the permissions were not given correctly to the pod on openshift. gave it the highest permissions with privileged. works now, thanks!