virtcontainers: Implement a generic way to control PCI device address on the guest

sboeuf commented 6 years ago

I've been discussing with people from Clear Containers team about the way we could improve our current codebase (virtcontainers) by providing a reliable way to know about the PCI BDF that we should expect on the guest for every PCI device that we cold/hot plug through the VM. Every PCI device defined by Qemu allow for the following options: addr and bus. The addr stands for the DF from BDF, which means Device and Function, that we can give with the following syntax: addr=02.5, where 02 is the third slot (slot is the equivalent of Device), and 5 is the sixth Function. About the parameter bus, it can be used to refer any bus that Qemu might know. By default, it does know the main bus pci.0, but this bus being limited to 32 devices, and also because some of them (0 and 1) are reserved by Qemu, the use of a PCI bridge as a new PCI bus will increase the number of devices we will be able to plug eventually. Those PCI bridges can be nominated with any id we want, and the parameter bus of a device will be able to refer to this id. Concretely, this means we can do something like this:

qemu ... \
-device pci-bridge,addr=15.0,chassis_nr=5,id=mybridgebus5 \
-device virtio-serial-pci,id=serial0,bus=mybridgebus5,addr=1f.0

From the guest, this is the list of devices we get:

# ls -la /sys/bus/pci/devices
lrwxrwxrwx 1 root root 0 Mar 22 05:51 0000:00:00.0 -> ../../../devices/pci0000:00/0000:00:00.0
lrwxrwxrwx 1 root root 0 Mar 22 05:51 0000:00:01.0 -> ../../../devices/pci0000:00/0000:00:01.0
lrwxrwxrwx 1 root root 0 Mar 22 05:51 0000:00:01.1 -> ../../../devices/pci0000:00/0000:00:01.1
lrwxrwxrwx 1 root root 0 Mar 22 05:51 0000:00:01.3 -> ../../../devices/pci0000:00/0000:00:01.3
lrwxrwxrwx 1 root root 0 Mar 22 05:51 0000:00:15.0 -> ../../../devices/pci0000:00/0000:00:15.0
lrwxrwxrwx 1 root root 0 Mar 22 05:51 0000:01:1f.0 -> ../../../devices/pci0000:00/0000:00:15.0/0000:01:1f.0

../../../devices/pci0000:00/0000:00:15.0 refers to the PCI bridge mybridgebus5 on the main bus pci.0. We can see that it has been plugged as expected as it shows up to the right location on the PCI bus 0. Now, we need one more step before we can find our virtio-serial-pci device back. We need the bus number corresponding to the bridge. Indeed, whatever number we choose for the chassis number, the bus number will be incremented according to what's available next. In this case, when the bridge was created, the PCI bus 1 was the next available. We can find this by looking into:

# ls -la /sys/bus/pci/devices/0000:00:15.0/pci_bus/
0000:01

Based on the bus number found from the address of the PCI bridge, we now know that all devices plugged through this bridge will end up on the PCI bus 1. And this leads us to find the complete BDF of our device on the guest: 0000:01:1f.0

BTW, we have got some feedback that our virtio-blk prediction name might not work in a consistent way (in case of containers being stopped and started inside the same pod), and this will also be the solution to solve it.

WDYT ? Should we go ahead and start implementing this ?

sboeuf commented 6 years ago

/cc @egernst @bergwolf @laijs @WeiZhang555 @amshinde @devimc @jodh-intel @mcastelino @sameo

bergwolf commented 6 years ago

@sboeuf overall the approach sounds good to me! This looks to be more stable than virtio-blk. I guess we can have a pci bridge data structure in the qemu driver and use it to track the pci-bridge addr and its bus number. Then each pci device can reference the pci bridge. The same data structure can be used by other hypervisors like libvirt as well but we can keep it in qemu until we support other hypervisors.

jshachm commented 6 years ago

@sboeuf @bergwolf Based on the problems we have faced these days with virtio-blk prediciton name. Specific addr and bus is needed for management and extension. Can't wait to see it.

WeiZhang555 commented 6 years ago

@sboeuf I like the design. A stupid question: will this still have a limitation of PCI device number( guess 30 bridges * 32) ? Or can it be extended unlimitedly?

sboeuf commented 6 years ago

@WeiZhang555 I don't know the limit about the number of bridges (@devimc @mcastelino any idea about this ?). Every bridge is limited to 32 devices, according to PCI.

amshinde commented 6 years ago

I have been taking a look at this, one of the things we need is to know the device node created under /dev as a result of a device attach. Right now we watch the uevents that we receive from the kernel, which gives us the "devpath" like "/devices/pci0000:00/0000:00:03.0/0000:00:01.0/block/vda". We were relying on the the last part of the path to get the device node (which works for the block devices we support currently but may not work for other kind of devices which have device nodes created under directories). I think what we really need to find out if a device attach results in a device node being created is something like udev_device_get_devnod in libudev library to get this information from the device database maintained by udev. This would require the kata agent to be dependent on the libudev library with the assumption that udev daemon is used for the device node management(which I think is the standard now) @bergwolf I am not sure how this will play when the agent is used as init 1, do you still run a separate udev deamon for managing device nodes?

bergwolf commented 6 years ago

@amshinde

which works for the block devices we support currently but may not work for other kind of devices which have device nodes created under directories

Can you explain a bit more why it does not work for other devices? I thought if we can rely on the PCI device address for virtio-blk device identification, it does not matter what device name it gives us in the end. Did you mean there are virtio-blk devices that can create the devices in different path patten?

bergwolf commented 6 years ago

@amshinde And to answer your question about udev daemon, we do want to avoid it for agent as init case.

We chose to always using virtio-scsi in runv. I see that virtcontainers support both of them. Is there a special use case for virtio-blk?

sboeuf commented 6 years ago

@bergwolf virtio-blk is our legacy support of block device. We didn't want to remove it from the code base since there are some specific use cases that might require it for better perf.

amshinde commented 6 years ago

@bergwolf Basically I want to know what the device node is under /dev. We want to add support for passing all kinds of devices (audio, GPU etc) with VFIO, and we want to know the device node created under /dev for these kind of devices. For eg a graphics card may appear as /dev/dri/card0, which is a heirarchical device name. Looking at the uevent we would not be able to deduce the device node name, we would need to ask udev to know whats the actual device created.

amshinde commented 6 years ago

cc @sameo

bergwolf commented 6 years ago

@amshinde The kernel devtmpfs is responsible to create proper device nodes under /dev. udev is just managing all the by-uuid alike symlinks. So udev is not a must to get a working device tree. In your provided example, when you have /dev/dri/card0, you should also have something like /sys/devices/pci0000:00/0000:00:0f.0/drm/card0 and we can get the device PCI address -> device name mapping from there.

sboeuf commented 6 years ago

This looks like a good approach. If we can determine from the sysfs the device name, then we get what we want. But now I am wondering how does this apply to a case where udev has some rules on the system, do we still keep the original device and udev creates a duplicate, or is the device completely replaced ?

amshinde commented 6 years ago

@bergwolf Not all uevents result in device nodes being created. Also, the actual device node created is quite different from the syspath provided in the uevent. In the example you have given as well, I want to derive from /sys/devices/pci0000:00/0000:00:0f.0/drm/card0 to /dev/dri/card0 in a reliable and generic way for all devices for which I found the udev apis quite useful. I suppose I can use the DEVNAME field in the uevent (which we have not been capturing so far), I think that field is populated by the kernel when an actual device node is created.

bergwolf commented 6 years ago

@sboeuf I assume we do not rely on udev and thus do not start the udev daemon.

@amshinde Are you worried about getting too many uevents? By the time agent is up running, the uevents we get are most likely just for those hot plugged devices. Anyway capturing DEVNAME is also a good idea.

kata-containers / runtime

virtcontainers: Implement a generic way to control PCI device address on the guest #115