Closed sboeuf closed 3 years ago
/cc @egernst @bergwolf @laijs @WeiZhang555 @amshinde @devimc @jodh-intel @mcastelino @sameo
@sboeuf overall the approach sounds good to me! This looks to be more stable than virtio-blk
. I guess we can have a pci bridge data structure in the qemu driver and use it to track the pci-bridge addr and its bus number. Then each pci device can reference the pci bridge. The same data structure can be used by other hypervisors like libvirt
as well but we can keep it in qemu until we support other hypervisors.
@sboeuf @bergwolf
Based on the problems we have faced these days with virtio-blk
prediciton name. Specific addr and bus is needed for management and extension. Can't wait to see it.
@sboeuf I like the design. A stupid question: will this still have a limitation of PCI device number( guess 30 bridges * 32) ? Or can it be extended unlimitedly?
@WeiZhang555 I don't know the limit about the number of bridges (@devimc @mcastelino any idea about this ?). Every bridge is limited to 32 devices, according to PCI.
I have been taking a look at this, one of the things we need is to know the device node created under /dev as a result of a device attach. Right now we watch the uevents that we receive from the kernel, which gives us the "devpath" like "/devices/pci0000:00/0000:00:03.0/0000:00:01.0/block/vda". We were relying on the the last part of the path to get the device node (which works for the block devices we support currently but may not work for other kind of devices which have device nodes created under directories).
I think what we really need to find out if a device attach results in a device node being created is something like udev_device_get_devnod
in libudev library to get this information from the device database maintained by udev.
This would require the kata agent to be dependent on the libudev library with the assumption that udev daemon is used for the device node management(which I think is the standard now)
@bergwolf I am not sure how this will play when the agent is used as init 1, do you still run a separate udev deamon for managing device nodes?
@amshinde
which works for the block devices we support currently but may not work for other kind of devices which have device nodes created under directories
Can you explain a bit more why it does not work for other devices? I thought if we can rely on the PCI device address for virtio-blk device identification, it does not matter what device name it gives us in the end. Did you mean there are virtio-blk devices that can create the devices in different path patten?
@amshinde And to answer your question about udev daemon, we do want to avoid it for agent as init case.
We chose to always using virtio-scsi in runv. I see that virtcontainers support both of them. Is there a special use case for virtio-blk?
@bergwolf virtio-blk
is our legacy support of block device. We didn't want to remove it from the code base since there are some specific use cases that might require it for better perf.
@bergwolf Basically I want to know what the device node is under /dev. We want to add support for passing all kinds of devices (audio, GPU etc) with VFIO, and we want to know the device node created under /dev for these kind of devices. For eg a graphics card may appear as /dev/dri/card0
, which is a heirarchical device name. Looking at the uevent we would not be able to deduce the device node name, we would need to ask udev to know whats the actual device created.
cc @sameo
@amshinde The kernel devtmpfs is responsible to create proper device nodes under /dev
. udev is just managing all the by-uuid
alike symlinks. So udev is not a must to get a working device tree. In your provided example, when you have /dev/dri/card0
, you should also have something like /sys/devices/pci0000:00/0000:00:0f.0/drm/card0
and we can get the device PCI address -> device name mapping from there.
This looks like a good approach. If we can determine from the sysfs the device name, then we get what we want. But now I am wondering how does this apply to a case where udev has some rules on the system, do we still keep the original device and udev creates a duplicate, or is the device completely replaced ?
@bergwolf Not all uevents result in device nodes being created. Also, the actual device node created is quite different from the syspath provided in the uevent. In the example you have given as well, I want to derive from /sys/devices/pci0000:00/0000:00:0f.0/drm/card0
to /dev/dri/card0
in a reliable and generic way for all devices for which I found the udev apis quite useful. I suppose I can use the DEVNAME
field in the uevent (which we have not been capturing so far), I think that field is populated by the kernel when an actual device node is created.
@sboeuf I assume we do not rely on udev and thus do not start the udev daemon.
@amshinde Are you worried about getting too many uevents? By the time agent is up running, the uevents we get are most likely just for those hot plugged devices. Anyway capturing DEVNAME
is also a good idea.
I've been discussing with people from Clear Containers team about the way we could improve our current codebase (virtcontainers) by providing a reliable way to know about the PCI
BDF
that we should expect on the guest for every PCI device that we cold/hot plug through the VM. Every PCI device defined by Qemu allow for the following options:addr
andbus
. Theaddr
stands for theDF
fromBDF
, which means Device and Function, that we can give with the following syntax:addr=02.5
, where02
is the third slot (slot is the equivalent of Device), and5
is the sixth Function. About the parameterbus
, it can be used to refer any bus that Qemu might know. By default, it does know the main buspci.0
, but this bus being limited to 32 devices, and also because some of them (0 and 1) are reserved by Qemu, the use of a PCI bridge as a new PCI bus will increase the number of devices we will be able to plug eventually. Those PCI bridges can be nominated with anyid
we want, and the parameterbus
of a device will be able to refer to thisid
. Concretely, this means we can do something like this:From the guest, this is the list of devices we get:
../../../devices/pci0000:00/0000:00:15.0
refers to the PCI bridgemybridgebus5
on the main buspci.0
. We can see that it has been plugged as expected as it shows up to the right location on the PCI bus 0. Now, we need one more step before we can find ourvirtio-serial-pci
device back. We need the bus number corresponding to the bridge. Indeed, whatever number we choose for the chassis number, the bus number will be incremented according to what's available next. In this case, when the bridge was created, the PCI bus 1 was the next available. We can find this by looking into:Based on the bus number found from the address of the PCI bridge, we now know that all devices plugged through this bridge will end up on the PCI bus
1
. And this leads us to find the completeBDF
of our device on the guest:0000:01:1f.0
BTW, we have got some feedback that our
virtio-blk
prediction name might not work in a consistent way (in case of containers being stopped and started inside the same pod), and this will also be the solution to solve it.WDYT ? Should we go ahead and start implementing this ?