Closed amorenoz closed 4 years ago
@amorenoz @devimc as a fallback mechanism can we check the value for under /sys/bus/pci/devices/xxx/max_link_speed to be >= 5.0 GT/s? This seems to be always populated.
If it's fine I can work on a PR
@bpradipt If we do that, we will have to detect if the device is SR-IOV and probe the PF:
$ lspci
65:00.0 Ethernet controller: Intel Corporation Ethernet Controller X710 for 10GbE SFP+ (rev 02)
65:00.1 Ethernet controller: Intel Corporation Ethernet Controller X710 for 10GbE SFP+ (rev 02)
65:02.0 Ethernet controller: Intel Corporation Ethernet Virtual Function 700 Series (rev 02)
65:02.1 Ethernet controller: Intel Corporation Ethernet Virtual Function 700 Series (rev 02)
65:02.2 Ethernet controller: Intel Corporation Ethernet Virtual Function 700 Series (rev 02)
65:02.3 Ethernet controller: Intel Corporation Ethernet Virtual Function 700 Series (rev 02)
$ cat /sys/bus/pci/devices/0000\:65\:02.3/max_link_speed
Unknown speed
$ cat /sys/bus/pci/devices/0000\:65\:02.3/physfn/max_link_speed
8 GT/s
I would make a more general question: Do we need to check if it's PCIe? I think qemu should cope with a PCI device as well.
My guess is this is an optimisation to avoid creating the root port if not necessary, so no I suspect it's mostly fine to plug into a PCIe root port on q35. I guess a PCIe device should have an Express Endpoint capability - but I don't know an easy way to check for it in /proc. I wonder if you need to check if the device has ioport addresses and take care when creating the root port.
@devimc @amshinde What is the usecase of having a non-PCIe device hotplugged on a q35 kata VM? AFAICS, the decision on the client is:
@amorenoz For q35, we were hotplugging devices on on pcie-pci-bridge to avoid creating root ports. The functionality to hotplug devices on a pcie root port was added for devices that require a large bar space and not to lost pcie specific functionality. Like you mentioned, we do bear the cost of making of a pcie-root-port here.
Seeing the recent issues around this with complexities around the q35 hotplug code in Kata , and @devimc's recent proposal around removing the PCI scan, I think we can consider the following to resolve the issues and simplify pcie hotplug in general:
1) Either go with @devimc proposal to remove the pci rescan code. This means we would incur the 5s delay for devices hotplugged on pcie-to-pci-bridge, which is certainly not acceptable for container startup time. We would need to get rid of hotplugging devices on pcie-to-pci-bridge and instead hotplug all devices on pcie root port instead. This would mean we need to configure the pcie root ports for a pod, knowing the number of devices that need to be passed to all the containers in a pod. (In addition to VFIO, we hotplug block based devices as well) 2) Instead of getting rid of the rescan code altogether, how about we lazy attach(hotplug after rescan has run) all VFIO devices, not just large bar devices on pcie root ports while hotplugging virtio-blk/net devices on pcie-to-pci bridge. This may solve the issues that you are seeing around VFIO passthrough. We can get away with creating not too many pcie root ports with this approach as typically not many vfio devices are passed through.
1. Either go with @devimc proposal to remove the pci rescan code. This means we would incur the 5s delay for devices hotplugged on pcie-to-pci-bridge, which is certainly not acceptable for container startup time. We would need to get rid of hotplugging devices on pcie-to-pci-bridge and instead hotplug all devices on pcie root port instead. This would mean we need to configure the pcie root ports for a pod, knowing the number of devices that need to be passed to all the containers in a pod. (In addition to VFIO, we hotplug block based devices as well)
Some questions: In what situations are block devices hotplugged on the bridge? Are netdevs also hotplugged on the bridge right now?
2. Instead of getting rid of the rescan code altogether, how about we lazy attach(hotplug after rescan has run) all VFIO devices, not just large bar devices on pcie root ports while hotplugging virtio-blk/net devices on pcie-to-pci bridge. This may solve the issues that you are seeing around VFIO passthrough. We can get away with creating not too many pcie root ports with this approach as typically not many vfio devices are passed through.
Note that AFAICS, the race condition can also happen on other kind of devices (tried with e1000). So if the rescan stays, we should introduce a mechanism to make sure the race is avoided.
How much time does the lazy attachment wait? I would not like to force a client that wants to hotplug vfio devices on a pcie-root-port to pay double price (slow startup and pre-allocate the root-ports). Any chance we can avoid the rescan (instead of waiting for it to finish) in the case of vfio + root-ports?
cc @bpradipt @dgibson. Any comments from the power side?
@amorenoz
In what situations are block devices hotplugged on the bridge?
When devicemapper is the graphdriver, host's dm devices are hotplugged
Are netdevs also hotplugged on the bridge right now?
afaik no, question for @amshinde
@amshinde
Instead of getting rid of the rescan code altogether, how about we lazy attach(hotplug after rescan has run) all VFIO devices, not just large bar devices on pcie root ports while hotplugging virtio-blk/net devices on pcie-to-pci bridge. This may solve the issues that you are seeing around VFIO passthrough. We can get away with creating not too many pcie root ports with this approach as typically not many vfio devices are passed through.
This is prone to race conditions, for example, in a pod with 2 containers a GPU was hot-added for container1
and since kata doesn't wait for the device initialization (5 seconds), kata will create container2
, hence the pci bus is rescanned again, messing up the GPU that is being initialized.
I was able to reproduce this manually (I wonder why this has not been reported yet)
lspci -k -v -s 01:01.0
01:01.0 Ethernet controller: Red Hat, Inc. Virtio network device (rev ff) (prog-if ff)
!!! Unknown header type 7f
Kernel driver in use: virtio-pci
This bug is basically not relevant for POWER: POWER doesn't use the shpc driver, and its para-virtualized PCI bus has vanilla-PCI style toplogy rather than PCI-E, even though it can accommodate PCI-E devices. So we don't use root ports at all.
Also @devimc @amshinde. Note that I was able to reproduce the race conditions (without Kata) very easily on the bridge hotplug and hotunplug and also in the unplug on pcie-root-ports. However, in my test with Kata, the pci-rescan also broke pcie-root-port hotplugging.
Had a look at this and realized there's a singificantly simpler - and more accurate - way of identifying PCIe devices: just look at the size of the 'config' file in sysfs. That will either be 256 bytes (plain PCI config space) or 4096 bytes (PCIe extended config space).
I'm working on a fix for this now.
Description of problem
A PCIe device should be pluggable into a pcie-root-port using the following combination of options:
However, doing so with devices connected to root-ports in the host fails with:
The reason is that isPCIeDevice looks at /sys/bus/pci/slots to determine whether the device is PCIe based on the slot reported speed. This mechanism is not generic enough and fails to recognize devices that are connected to root-ports such as:
Expected result
Hotplug into pcie-root-port should work
Actual result
Devices are not plugged into pcie-root-ports. In q35 this means kata tries to plug them into the root bus, which does not support hot-plugging and the following error message is shown: However, doing so with devices connected to root-ports in the host fails with: