Understanding and controlling multi-GPU behavior

intel / intel-device-plugins-for-kubernetes

Collection of Intel device plugins for Kubernetes

Apache License 2.0

31 stars 203 forks source link

Understanding and controlling multi-GPU behavior #1815

Open frenchwr opened 2 weeks ago

frenchwr commented 2 weeks ago

Describe the support request Hi there, this is a bit of a follow up on my previous issue (https://github.com/intel/intel-device-plugins-for-kubernetes/issues/1769).

What is the behavior of the GPU plugin on a multi-Intel-GPU system when installing with NFD where an app requests a GPU with (assume only i915 driver enabled on host):

resources:
  limits:
    gpu.intel.com/i915: 1

For example:

Which GPU device will be used for the first app requesting a GPU? Is there any way to control this?
It appears from the docs that when using sharedDevNum=N all slots will be filled on one of the GPUs before apps are scheduled on the next GPU? Is that right?

System (please complete the following information if applicable):

OS version: Ubuntu 22.04, 24.04
Kernel version: 6.8.0-40-generic
Device plugins version: v0.30.0
Hardware info: iGPU and dGPU

tkatila commented 2 weeks ago

Hi @frenchwr, sorry for the delay. I forgot to answer.

What is the behavior of the GPU plugin on a multi-Intel-GPU system when installing with NFD where an app requests a GPU with (assume only i915 driver enabled on host):

We don't really support multi-GPU scenarios where there are different GPU types in one node. They are registered under the same i915 resource name and it isn't possible to request a specific one from them.

Which GPU device will be used for the first app requesting a GPU? Is there any way to control this?

Scheduler gives us a list of possible devices and from them the first n devices are selected based on how many i915 resources are requested. When used with "shared-dev-num > 1", allocationPolicy changes which GPUs are filled first.

It appears from the docs that when using sharedDevNum=N all slots will be filled on one of the GPUs before apps are scheduled on the next GPU? Is that right?

Depends on the policy. With "packed" policy, first GPU is filled, then second, then third etc. With "balanced", first container goes to gpu1, then gpu2, gpu3, gpu1, gpu2, gpu3, gpu1 etc. where there are three GPUs on a node.

frenchwr commented 2 weeks ago

@tkatila - no worries! I appreciate the reply.

Depends on the policy. With "packed" policy, first GPU is filled, then second, then third etc. With "balanced", first container goes to gpu1, then gpu2, gpu3, gpu1, gpu2, gpu3, gpu1 etc. where there are three GPUs on a node.

The situation I'm imagining is a user has an iGPU and dGPU on a single system. I think in most scenarios a user will prefer the dGPU be used first, but on an example system I see the dGPU listed as card1 while the iGPU is listed as card0:

# Intel® Arc ™ Pro A60M Graphics
card1                    8086:56b2                         pci:vendor=8086,device=56B2,card=0
└─renderD129

# Intel® Iris® Xe Graphics (Raptor Lake)            
card0                    8086:a7a0                         pci:vendor=8086,device=A7A0,card=0
└─renderD128

Does this mean the plugin would use the iGPU first?

eero-t commented 2 weeks ago

on an example system I see the dGPU listed as card1 while the iGPU is listed as card0

Device name indexes and device file names in general, come from sysfs & devfs i.e. kernel.

Container runtimes map the whole host sysfs to containers. While it would be possible to map a device file to /dev/dri/ within the container using some other name than what device has in sysfs, that would only cause problems, because many applications scan also sysfs.

Note: not mapping the device file names has also a problem, but that's limited to legacy media APIs: https://github.com/intel/intel-device-plugins-for-kubernetes/blob/main/cmd/gpu_plugin/README.md#issues-with-media-workloads-on-multi-gpu-setups

Does this mean the plugin would use the iGPU first?

GPU plugin will just list the devices for k8s scheduler as extended resources. It's k8s scheduler which will then select one of them. So yes, it may be first selection.

GAS (with some help from GPU plugin) can provide extra control over that: https://github.com/intel/platform-aware-scheduling/blob/master/gpu-aware-scheduling/docs/usage.md

You could use GAS denylist container annotation for ignoring card0, or ask for a resource missing from iGPUs (VRAM), or just disable iGPU from BIOS.

PS. @tkatila, I remember you earlier were looking into GPU plugin option for ignoring iGPUs. Did anything come out of it, I don't see it mentioned in GPU plugin README?

tkatila commented 2 weeks ago

You could use GAS denylist container annotation for ignoring card0, or ask for a resource missing from iGPUs (VRAM), or just disable iGPU from BIOS.

GAS assumes homogeneous cluster. If node has 2 GPUs and reports 16GB of VRAM, GAS calculates that each GPU has 8GB of VRAM. So using VRAM as a resource doesn't work. Denylist has the issue with cards enumerating in different order.

PS. @tkatila, I remember you earlier were looking into GPU plugin option for ignoring iGPUs. Did anything come out of it, I don't see it mentioned in GPU plugin README?

Yeah, I did wonder about it. But it got buried under other things. The idea was to have two methods:

Whitelist
- Only register GPUs of certain PCI Device ID
Resource renaming based on GPU type
- Rename the i915 resource as i915-flex, i915-arc etc. or just i915-0x1234 by the PCI Device ID.

In general, I don't think one should depend on having card0 to be iGPU or the other way around. They can enumerate in different order in some boots and then the wrong GPU would be used. Also, I'm not sure scheduler returns the device list as sorted.

eero-t commented 2 weeks ago

Denylist has the issue with cards enumerating in different order. ... In general, I don't think one should depend on having card0 to be iGPU or the other way around. They can enumerate in different order in some boots and then the wrong GPU would be used.

I don't think I've ever seem (Intel) iGPU as anything else than card0, so I thought driver checks that before dGPUs. Have you seen it under some other name?

Note: if there are other than Intel GPUs and their kernel drivers, then it's possible other GPU driver would be loaded before Intel one, meaning that Intel indexes would not start from 0 => denylist is not be a general solution, just a potential workaround for this particular ticket (which I thought to be about nodes having only Intel GPUs)...

tkatila commented 2 weeks ago

I don't think I've ever seem (Intel) iGPU as anything else than card0, so I thought driver checks that before dGPUs. Have you seen it under some other name?

Note: if there are other than Intel GPUs and their kernel drivers, then it's possible other GPU driver would be loaded before Intel one, meaning that Intel indexes would not start from 0 => denylist is not be a general solution, just a potential workaround for this particular ticket (which I thought to be about nodes having only Intel GPUs)...

If only Intel cards are in the host, yes, then card0 will be Intel. I've seen cases where KVM has occupied card0 and Intel cards have then taken card1 etc.

But the point I was trying to say was that depending on the boot, iGPU could be card0 or card1. At least with multiple cards the PCI address of the cards vary between boots.