amorenoz commented 4 years ago

Description of problem

A PCIe device should be pluggable into a pcie-root-port using the following combination of options:

hotplug_vfio_on_root_bus = true
pcie_root_port = 2

However, doing so with devices connected to root-ports in the host fails with:

Bus 'pcie.0' does not support hotplugging: OCI runtime error

The reason is that isPCIeDevice looks at /sys/bus/pci/slots to determine whether the device is PCIe based on the slot reported speed. This mechanism is not generic enough and fails to recognize devices that are connected to root-ports such as:

$ lspci -t
[...]
 +-[0000:64]-+-00.0-[65-66]--+-00.0  Intel Corporation Ethernet Controller X710 for 10GbE SFP+                                                                                                                                                  
 |           |               \-00.1  Intel Corporation Ethernet Controller X710 for 10GbE SFP+
[...]
$ lspci -s 000:64:00.0 
64:00.0 PCI bridge: Intel Corporation Sky Lake-E PCI Express Root Port A (rev 04)
$ cat  /sys/bus/pci/slots/2/{address,max_bus_speed}
0000:64:00
Unknown

Expected result

Hotplug into pcie-root-port should work

Actual result

Devices are not plugged into pcie-root-ports. In q35 this means kata tries to plug them into the root bus, which does not support hot-plugging and the following error message is shown: However, doing so with devices connected to root-ports in the host fails with:

Bus 'pcie.0' does not support hotplugging: OCI runtime error

bpradipt commented 4 years ago

@amorenoz @devimc as a fallback mechanism can we check the value for under /sys/bus/pci/devices/xxx/max_link_speed to be >= 5.0 GT/s? This seems to be always populated.

If it's fine I can work on a PR

amorenoz commented 4 years ago

@bpradipt If we do that, we will have to detect if the device is SR-IOV and probe the PF:

$ lspci
65:00.0 Ethernet controller: Intel Corporation Ethernet Controller X710 for 10GbE SFP+ (rev 02)
65:00.1 Ethernet controller: Intel Corporation Ethernet Controller X710 for 10GbE SFP+ (rev 02)
65:02.0 Ethernet controller: Intel Corporation Ethernet Virtual Function 700 Series (rev 02)
65:02.1 Ethernet controller: Intel Corporation Ethernet Virtual Function 700 Series (rev 02)
65:02.2 Ethernet controller: Intel Corporation Ethernet Virtual Function 700 Series (rev 02)
65:02.3 Ethernet controller: Intel Corporation Ethernet Virtual Function 700 Series (rev 02)

$ cat /sys/bus/pci/devices/0000\:65\:02.3/max_link_speed
Unknown speed
$ cat /sys/bus/pci/devices/0000\:65\:02.3/physfn/max_link_speed
8 GT/s

amorenoz commented 4 years ago

I would make a more general question: Do we need to check if it's PCIe? I think qemu should cope with a PCI device as well.

dagrh commented 4 years ago

My guess is this is an optimisation to avoid creating the root port if not necessary, so no I suspect it's mostly fine to plug into a PCIe root port on q35. I guess a PCIe device should have an Express Endpoint capability - but I don't know an easy way to check for it in /proc. I wonder if you need to check if the device has ioport addresses and take care when creating the root port.

amorenoz commented 4 years ago

@devimc @amshinde What is the usecase of having a non-PCIe device hotplugged on a q35 kata VM? AFAICS, the decision on the client is:

Use pcie-to-pci-bridge and pay it's cost: 5s delay + MMIO space limitations
Use pcie-root-port and pay it's cost: have to pre-allocate pcie-root-ports (or use the annotations)

amshinde commented 4 years ago

@amorenoz For q35, we were hotplugging devices on on pcie-pci-bridge to avoid creating root ports. The functionality to hotplug devices on a pcie root port was added for devices that require a large bar space and not to lost pcie specific functionality. Like you mentioned, we do bear the cost of making of a pcie-root-port here.

Seeing the recent issues around this with complexities around the q35 hotplug code in Kata , and @devimc's recent proposal around removing the PCI scan, I think we can consider the following to resolve the issues and simplify pcie hotplug in general:

1) Either go with @devimc proposal to remove the pci rescan code. This means we would incur the 5s delay for devices hotplugged on pcie-to-pci-bridge, which is certainly not acceptable for container startup time. We would need to get rid of hotplugging devices on pcie-to-pci-bridge and instead hotplug all devices on pcie root port instead. This would mean we need to configure the pcie root ports for a pod, knowing the number of devices that need to be passed to all the containers in a pod. (In addition to VFIO, we hotplug block based devices as well) 2) Instead of getting rid of the rescan code altogether, how about we lazy attach(hotplug after rescan has run) all VFIO devices, not just large bar devices on pcie root ports while hotplugging virtio-blk/net devices on pcie-to-pci bridge. This may solve the issues that you are seeing around VFIO passthrough. We can get away with creating not too many pcie root ports with this approach as typically not many vfio devices are passed through.

amorenoz commented 4 years ago

1. Either go with @devimc proposal to remove the pci rescan code. This means we would incur the 5s delay for devices hotplugged on pcie-to-pci-bridge, which is certainly not acceptable for container startup time. We would need to get rid of hotplugging devices on pcie-to-pci-bridge and instead hotplug all devices on pcie root port instead. This would mean we need to configure the pcie root ports for a pod, knowing the number of devices that need to be passed to all the containers in a pod. (In addition to VFIO, we hotplug block based devices as well)

Some questions: In what situations are block devices hotplugged on the bridge? Are netdevs also hotplugged on the bridge right now?

2. Instead of getting rid of the rescan code altogether, how about we lazy attach(hotplug after rescan has run) all VFIO devices, not just large bar devices on pcie root ports while hotplugging virtio-blk/net devices on pcie-to-pci bridge. This may solve the issues that you are seeing around VFIO passthrough. We can get away with creating not too many pcie root ports with this approach as typically not many vfio devices are passed through.

Note that AFAICS, the race condition can also happen on other kind of devices (tried with e1000). So if the rescan stays, we should introduce a mechanism to make sure the race is avoided.

How much time does the lazy attachment wait? I would not like to force a client that wants to hotplug vfio devices on a pcie-root-port to pay double price (slow startup and pre-allocate the root-ports). Any chance we can avoid the rescan (instead of waiting for it to finish) in the case of vfio + root-ports?

amorenoz commented 4 years ago

cc @bpradipt @dgibson. Any comments from the power side?

devimc commented 4 years ago

@amorenoz

In what situations are block devices hotplugged on the bridge?

When devicemapper is the graphdriver, host's dm devices are hotplugged

Are netdevs also hotplugged on the bridge right now?

afaik no, question for @amshinde

@amshinde

Instead of getting rid of the rescan code altogether, how about we lazy attach(hotplug after rescan has run) all VFIO devices, not just large bar devices on pcie root ports while hotplugging virtio-blk/net devices on pcie-to-pci bridge. This may solve the issues that you are seeing around VFIO passthrough. We can get away with creating not too many pcie root ports with this approach as typically not many vfio devices are passed through.

This is prone to race conditions, for example, in a pod with 2 containers a GPU was hot-added for container1 and since kata doesn't wait for the device initialization (5 seconds), kata will create container2, hence the pci bus is rescanned again, messing up the GPU that is being initialized. I was able to reproduce this manually (I wonder why this has not been reported yet)

lspci -k -v -s 01:01.0
01:01.0 Ethernet controller: Red Hat, Inc. Virtio network device (rev ff) (prog-if ff)
    !!! Unknown header type 7f
    Kernel driver in use: virtio-pci

dgibson commented 4 years ago

This bug is basically not relevant for POWER: POWER doesn't use the shpc driver, and its para-virtualized PCI bus has vanilla-PCI style toplogy rather than PCI-E, even though it can accommodate PCI-E devices. So we don't use root ports at all.

amorenoz commented 4 years ago

Also @devimc @amshinde. Note that I was able to reproduce the race conditions (without Kata) very easily on the bridge hotplug and hotunplug and also in the unplug on pcie-root-ports. However, in my test with Kata, the pci-rescan also broke pcie-root-port hotplugging.

dgibson commented 4 years ago

Had a look at this and realized there's a singificantly simpler - and more accurate - way of identifying PCIe devices: just look at the size of the 'config' file in sysfs. That will either be 256 bytes (plain PCI config space) or 4096 bytes (PCIe extended config space).

I'm working on a fix for this now.

kata-containers / runtime

PcieDevice detection fails #2678

Description of problem

Expected result

Actual result