intel / intel-device-plugins-for-kubernetes

Collection of Intel device plugins for Kubernetes
Apache License 2.0
35 stars 203 forks source link

xpumanager sidecar version issues #1532

Closed vbedida79 closed 1 year ago

vbedida79 commented 1 year ago

Hi, xpumanager sidecar https://github.com/intel/intel-device-plugins-for-kubernetes/blob/main/deployments/xpumanager_sidecar/kustomization.yaml uses xpumanager v1.2.0_golden. Is there a specific requirement of this version with the sidecar? On OCP 4.12, this version fails with createcontainererror xpumanager daemonset since the command has changed in the latest intel/xpumanager image. Daemonset uses the latest image but with v1.2.0 commands. Kustomization.yaml may need a newer version, is that right? Is there a specific version xpumanager recommended to try with sidecar? Also noticed this https://github.com/intel/xpumanager/issues/64 with Intel Data Center GPU flex 140 and side car- could this be related or has been seen before? Thank you!

tkatila commented 1 year ago

Currently the only use for the xpumanager sidecar is to allow GAS to to schedule Pods to GPUs that are interconnected with xelinks. Flex series doesn't benefit from that.

For the versions. Xpu-manager fixed a bug related to the xelink topology metrics with the 1.2.16 version. But that image hasn't been released yet. Though, the bug is related to dynamic changes of the xelinks and somewhat hard to reproduce. So 1.2.13 is ok for the intermediate.

For the createcontainererror, I think I've seen that and it was related to a change in the file structure in the xpu-manager container. I'll get back to this.

For https://github.com/intel/xpumanager/issues/64, I haven't seen this.

vbedida79 commented 1 year ago

Got it, thanks for clarifying. We used xpumanager with v1.2.0_golden with sidecar and GPU plugin 0.26.1 and Flex 140 to get GPU utilization metrics. To avoid, createcontainererror, used 1.2.13 with the dockerhub image with same image tag. But see the issue 64. So for Flex, can I assume just deploying the xpumanager daemonset directly should suffice. Is this correct?

tkatila commented 1 year ago

So for Flex, can I assume just deploying the xpumanager daemonset directly should suffice. Is this correct?

Yes, xpumanager-sidecar doesn't add anything for the Flex series.

tkatila commented 1 year ago

@vbedida79 ok to close this?

vbedida79 commented 1 year ago

yes. thank you