device: Accept the PCIAddress in the BDF format for block devices

likebreath commented 4 years ago

The kata-agent now only accepts the PCIAddress from kata-runtime in the "bridgeAddr/deviceAddr" format, which is tied to the device hotplug interface in qemu. We should extend extend the kata-agent to accept input PCIAddress in the standard BDF format, which should be more generic to support different hypervisors, e.g. cloud-hypervisor.

dgibson commented 4 years ago

I don't think this is a good idea. To be useful as part of the runtime<->agent protocol, any address we use has to be meaningful to both host and guest. A (guest) BDF address is meaningful to the guest kernel, but in general the host can only guess at it. The BB part depends on the guest enumeration of the bus. Even if we count on the guest firmware to enumerate it in a particular way, the bus numbers don't necessarily have to even remain static for the lifetime of the guest.

The domain (DDDD part) has its own set of issues. On PC-like machines it will generally be consistent, because I think it's used as a parameter in one of the controller registers (although nearly all PC-like machines only use domain 0000 anyway). However PCI domains by definition have no PCI level connection to each other - they're independent host bridges - so there can be no PCI standard way of numbering them. On other platforms (e.g. IBM POWER) the bridges are identified by other means and the domain numbers are just allocated dynamically by the guest kernel (and on POWER it's common to have many domains).

Although the "bridgeAddr/slotAddr" format may have originated with qemu, it has real meaning in PCI space which both the host and guest can interpret, regardless of hypervisor - if CLH can generate (a guess at) guest BDF addresses, it can certainly generate the bridge and slot addresses (the "D" part of the BDF will be the device slot).

I would however suggest extending the bridgeAddr/slotAddr format to a more general "PCI path" by:

Allowing any number of entries, to allow for both devices plugged directly into the root bus, and for devices under several layers of bridge
Allowing each component to be SS.F rather than just SS to allow for multifunction devices.
Adding a (necessarily machine type specific) way of specifying which domain/host bridge to start the search from.

likebreath commented 4 years ago

@dgibson Thanks a lot the detailed comments and explanation. It is really a good discussion on defining/improving the agent-runtime protocol of locating (block) devices. Let's bring this topic to the broader users in our community. Please feel free to add more people here.

WDYT? @kata-containers/architecture-committee @amshinde @egernst @fidencio @sboeuf @rbradford @kata-containers/agent

sboeuf commented 3 years ago

I'd keep it simple. Since bridgeAddr/slotAddr is useful for some cases as explained by @dgibson I would simply check from the agent for the format received. By default the agent would expect bridgeAddr/slotAddr, but for VMMs with the ability to determine the full BDF, let the possibility to pass a BDF format instead.

dgibson commented 3 years ago

@sboeuf see.. I don't think that is keeping it simple. By allowing for both options, you're increasing complexity of the protocol. And increasing complexity of the protocol is a bigger deal than increasing complexity of individual implementations (e.g. having the runtime convert one or more BDFs from the VMM into bridgeaddr/slotAddr form).

My view on that is coloured by the fact that I don't think "VMMs with the ability to determine the full BDF" is really a thing. There might be VMMs that think they can predict the BDF, but there's really no way it can be robust and reliable: even if it works now for the simple cases they support, it could well break in future.

sboeuf commented 3 years ago

And increasing complexity of the protocol is a bigger deal than increasing complexity of individual implementations (e.g. having the runtime convert one or more BDFs from the VMM into bridgeaddr/slotAddr form).

I'm definitely okay with that. If this gives slightly more work to the runtime, so that the agent's protocol can stay simple, it sounds like a good compromise. The part where I'm skeptical this would work is the way the agent is going to find the device in the guest based on the bridgeaddr/slotAddr. Because the agent will expect a bridge to be there, the way to find the device might not be appropriate for simple use case where there's no PCI to PCI bridge. I'll let @likebreath investigate and give his feedback on this, but I don't think the current agent implementation would work with CH.

even if it works now for the simple cases they support, it could well break in future.

Fair enough.

dgibson commented 3 years ago

Sorry, I should clarify. I think we should stick to just the "PCI path" form, of which bridgeAddr/slotAddr is a special case (and the only one supported so far. I do think we should extend that to allow a single entry (no bridge) or >2 entries (multiple bridges). That's a strict generalization of the existing format, rather than an entirely new one, though.

Also note, though, that any hotplugged PCI-E device will be under at least one (virtual) PCI to PCI bridge - that's what a root port is, and the PCI-E hotplug protocol requires a root port (or PCI-E switch, which introduces at least 2 virtual P2P bridges). The only way to not have P2P bridges at all is to restrict yourself to vanilla-PCI. Even there, the SHPC hotplug protocol requires a bridge, I believe, though I'm not sure about the ACPI hotplug protocol.

sboeuf commented 3 years ago

I think we should stick to just the "PCI path" form

What do you mean by PCI path?

Also note, though, that any hotplugged PCI-E device will be under at least one (virtual) PCI to PCI bridge - that's what a root port is, and the PCI-E hotplug protocol requires a root port (or PCI-E switch, which introduces at least 2 virtual P2P bridges). The only way to not have P2P bridges at all is to restrict yourself to vanilla-PCI. Even there, the SHPC hotplug protocol requires a bridge, I believe, though I'm not sure about the ACPI hotplug protocol.

When hotplugging through ACPI, no bridge is needed. That's what we use with CH, it's the simplest way of performing PCI hotplug.

dgibson commented 3 years ago

I think we should stick to just the "PCI path" form

What do you mean by PCI path?

I mean listing the slot+function of every bridge leading to the device, then the device itself. bridgeAddr/slotAddr is a PCI path with the restriction that there's exactly 1 bridge, and we only use function 0.

Also note, though, that any hotplugged PCI-E device will be under at least one (virtual) PCI to PCI bridge - that's what a root port is, and the PCI-E hotplug protocol requires a root port (or PCI-E switch, which introduces at least 2 virtual P2P bridges). The only way to not have P2P bridges at all is to restrict yourself to vanilla-PCI. Even there, the SHPC hotplug protocol requires a bridge, I believe, though I'm not sure about the ACPI hotplug protocol.

When hotplugging through ACPI, no bridge is needed. That's what we use with CH, it's the simplest way of performing PCI hotplug.

Ok. Does that allow PCI-E devices or only PCI? IIUC plugging a PCI-E device without a root port would mean treating it as an integrated endpoint, which would be... odd.. although it might work anyway.

c3d commented 3 years ago

I like the idea of having a somewhat general PCI path. However, I have a concern that this might be trying to generalize something that is, at its core, hypervisor dependent. In other words, what would a kernel running on qemu do with a path that does not have the bridge?

So it looks to me like the agent should have some hypervisor-dependent callback that knows how to deal with the guest device format generated by that hypervisor:

For qemu, you ask qmp, so you always expect a bridgeAddr/slotAddr format, and the recently proposed lookup code that uses /sys would work
for clh, ACPI hotplug might mean no bridge, so maybe another format is required, and that would in turn require a different callback in the agent that knows how to deal with that. That specific code might be able to work without a bridge at all, and might be unable to process /br1/br2/device at all.

So what about making the runtime prefix the address with a hypervisor name, and let the agent invoke different lookup code that may take different input? In other words, replace bridgeAddr/slotAddr with qemu:bridge/slotAddr, and allow later for clh:whateverACPIinfo with the corresponding handling code?

In my opinion, this also leaves the door open for future changes in either the qmp output format or PCI layout in the VM. So in the future, if qemu 6 changes something major, we could have different callbacks for qemu and qemu6.

dgibson commented 3 years ago

I like the idea of having a somewhat general PCI path. However, I have a concern that this might be trying to generalize something that is, at its core, hypervisor dependent.

No, it's really not. The PCI path as I'm describing it absolutely has a well-defined meaning purely in terms of PCI defined entities, regardless of hypervisor or platform. That's exactly why I'm suggesting it.

Well... to deal with the possibility of multiple PCI roots then we do need a platform dependent (as distinct from hypervisor dependent) extension to express which PCI root. But, one problem at a time.

In other words, what would a kernel running on qemu do with a path that does not have the bridge?

Grab the device with that slot on the root bus. With qemu, I don't think you can combine that with PCI-E, but that form is entirely plausible for either cold-plugged devices, or devices on the pc rather than q35 platform.

So it looks to me like the agent should have some hypervisor-dependent callback that knows how to deal with the guest device format generated by that hypervisor:

For qemu, you ask qmp, so you always expect a bridgeAddr/slotAddr format, and the recently proposed lookup code that uses /sys would work

Actually most instances of this format don't query qemu, instead we allocate the addresses and told qemu what to use. My VFIO code is an exception.

for clh, ACPI hotplug might mean no bridge, so maybe another format is required, and that would in turn require a different callback in the agent that knows how to deal with that. That specific code might be able to work without a bridge at all, and might be unable to process /br1/br2/device at all.

So what about making the runtime prefix the address with a hypervisor name, and let the agent invoke different lookup code that may take different input? In other words, replace bridgeAddr/slotAddr with qemu:bridge/slotAddr, and allow later for clh:whateverACPIinfo with the corresponding handling code?

Prefixing the string with a "scheme" identifier of some sort isn't necessarily a bad idea (particularly to deal with multiple domains in future). But tying that to hypervisor is broken (almost by definition - the address needs to be meaningful to the guest, which doesn't need to even be aware of what hypervisor it's running under).

In my opinion, this also leaves the door open for future changes in either the qmp output format or PCI layout in the VM. So in the future, if qemu 6 changes something major, we could have different callbacks for qemu and qemu6.

There is nothing specific to the qmp output format here.

likebreath commented 3 years ago

@dgibson Sorry for the delayed response. Thanks again for sharing your insights. Given your much better understanding and experience in the area than myself, would you mind drafting a PR based on the proposal you mentioned above (to extend current bridgeaddr/slotAddr format towards a more general PCI path format)? May be that's a good way to continue the discussions here? WDYT? Any other thoughts if you have, please don't hesitate to share.

Meanwhile, I can follow-up on verifying whether the bridgeaddr/slotAddr format would work for CLH. As @sboeuf mentioned earlier: "Because the agent will expect a bridge to be there, the way to find the device might not be appropriate for simple use case where there's no PCI to PCI bridge." Anything else if you need related to CLH, please let me know.

/cc @sboeuf @c3d @amshinde

dgibson commented 3 years ago

@likebreath sorry for the delay. I'm happy to tackle this PCI path cleanup, it's just a question of when. I had been planning to leave this until Kata 2, but it sounds like you might have need of it before then.

In fact, I've now encountered a very similar problem for myself. My #850 PR is blocked because there's a CI failure on VFIO. AFAICT at this stage, that failure is because it's doing a VFIO plug on cloud-hypervisor, and my code only adds PCI path generation for VFIO devices on qemu. It would be great if I can use your clh expertise to help fix this up.

dgibson commented 3 years ago

@likebreath I've created #854 to track this work (the agent side, anyway)

dgibson commented 3 years ago

@likebreath now that #855 is merged, should we close this issue?

likebreath commented 3 years ago

@dgibson Agree. Closing the issue in favor of #854 and #855.

kata-containers / agent

device: Accept the PCIAddress in the BDF format for block devices #823