Handling Special Case Compute Devices

Cloudsleuth commented 4 years ago

An administrator places into a BMH a PCIe card that has a fully programmable device where:

The device can act as a switch or as an off-load network controller, and requires uploading of an OS image to the PCIe-attached device on the host after which the device is externally programmable / configurable.
The device is a GPU that requires upload of an OS image or a bootstrap image to the PCIe-attached device on the host after which it can be programmed independently from its network connection

Question: How shall we classify such dependent hosts and then bring them into full manageability. Note: The key to this issue is that the host-like device is capable of being detected via BMH introspection.

dhellmann commented 4 years ago

When you say "requires uploading of an OS image" do you mean to the PCIe card itself or to the host as a whole?

Cloudsleuth commented 4 years ago

Thanks for asking for clarification. I edited the issue - hope that removed ambiguity.

dhellmann commented 4 years ago

I'm not that familiar with these sorts of hardware devices. How does provisioning those images work in practice? I assume you run a tool on the OS on the host? Is it possible to put that tool into a container with special privileges? For example, I could see us provisioning nodes in the usual way, then having the hardware-classification-controller recognizing that a host has a specific add-on card in it and adding a label. That label could trigger another controller to run a Job on that node, and the Job could include the instructions for writing the image to the add-on card.

zaneb commented 4 years ago

As far as programming devices connected via PCIe, I don't think that's Metal³'s problem and it can be left to something on the host.

It sounds like you're saying that the PCIe card acts a bit like a BMC once it's been programmed from the host, and that after this the device itself can be programmed over the network in a manner not totally dissimilar to how Metal³ programs servers? It's not clear to me yet whether we would or would not consider this in-scope for Metal³.

I think the challenge you were alluding to in the meeting was in modelling the dependency relationship, whereby the Host first has to be provisioned (and the PCIe card programmed) before anything can be done to the device over the network. It's possible that k8s's continuous reconciliation model may just make this moot. Alternatively I'd expect it to be handled by some sort of orchestration layer outside of the baremetal-operator that knows about the relationship between the two. So I don't actually foresee any architectural implications for Metal³ itself here.

Cloudsleuth commented 4 years ago

These are an entirely new class of device, which means that we will have to wait to see how this distills into production-ready devices. It is possible that the device will be provisioned (OS installed together with its runtime software/firmware and configuration) via the BMC prior to BMH host itself being provisioned with its OS. This means that the operational provisioning of the PCIe connected smart device (server/switch/GPU) must be done OOB. The BMC on the BMH will enable access to the PCIe connected device, the smart device will not be accessible to the host OS until the host has been provisioned. So that the host will boot with a status "healthy" the PCIe smart device must be capable of responding to validation probes that its interfaces are all correctly functional - this depends on the smart device being provisioned before the host itself.

Cloudsleuth commented 4 years ago

I'm not that familiar with these sorts of hardware devices. How does provisioning those images work in practice? I assume you run a tool on the OS on the host? Is it possible to put that tool into a container with special privileges? For example, I could see us provisioning nodes in the usual way, then having the hardware-classification-controller recognizing that a host has a specific add-on card in it and adding a label. That label could trigger another controller to run a Job on that node, and the Job could include the instructions for writing the image to the add-on card.

This proposed method would be very messy. Here is what you are in effect proposing: a) Boot the BMC host with a PXE image that can install the OS, runtime software and base-level configs to the PCIe smart device b) Warm reboot the BMH and provision the BMH

=> Messy.

Cloudsleuth commented 4 years ago

As far as programming devices connected via PCIe, I don't think that's Metal³'s problem and it can be left to something on the host.

It sounds like you're saying that the PCIe card acts a bit like a BMC once it's been programmed from the host, and that after this the device itself can be programmed over the network in a manner not totally dissimilar to how Metal³ programs servers? It's not clear to me yet whether we would or would not consider this in-scope for Metal³.

I think the challenge you were alluding to in the meeting was in modelling the dependency relationship, whereby the Host first has to be provisioned (and the PCIe card programmed) before anything can be done to the device over the network. It's possible that k8s's continuous reconciliation model may just make this moot. Alternatively I'd expect it to be handled by some sort of orchestration layer outside of the baremetal-operator that knows about the relationship between the two. So I don't actually foresee any architectural implications for Metal³ itself here.

What is the PCIe hosted device is in fact a BMH itself? What if it is accessible via the BMC of the hosting server? Now consider the case where that BMH can only be brought up in a state "healthy" when the PCIe-hosted BMH is first brought up (provisioned) to its state "healthy."

dhellmann commented 4 years ago

I'm not that familiar with these sorts of hardware devices. How does provisioning those images work in practice? I assume you run a tool on the OS on the host? Is it possible to put that tool into a container with special privileges? For example, I could see us provisioning nodes in the usual way, then having the hardware-classification-controller recognizing that a host has a specific add-on card in it and adding a label. That label could trigger another controller to run a Job on that node, and the Job could include the instructions for writing the image to the add-on card.

This proposed method would be very messy. Here is what you are in effect proposing: a) Boot the BMC host with a PXE image that can install the OS, runtime software and base-level configs to the PCIe smart device b) Warm reboot the BMH and provision the BMH

=> Messy.

Sure, if provisioning has to happen out of band then we wouldn't want to use a pod on the host. That wasn't clear from the original problem statement. Is that definitely how provisioning works in all cases?

Cloudsleuth commented 4 years ago

I'm not that familiar with these sorts of hardware devices. How does provisioning those images work in practice? I assume you run a tool on the OS on the host? Is it possible to put that tool into a container with special privileges? For example, I could see us provisioning nodes in the usual way, then having the hardware-classification-controller recognizing that a host has a specific add-on card in it and adding a label. That label could trigger another controller to run a Job on that node, and the Job could include the instructions for writing the image to the add-on card.

This proposed method would be very messy. Here is what you are in effect proposing: a) Boot the BMC host with a PXE image that can install the OS, runtime software and base-level configs to the PCIe smart device b) Warm reboot the BMH and provision the BMH => Messy.

Sure, if provisioning has to happen out of band then we wouldn't want to use a pod on the host. That wasn't clear from the original problem statement. Is that definitely how provisioning works in all cases?

Apologies for lack of full contextual clarity to begin with. I am looking at a number of devices that are sourced from various vendors. Some are considered development-only devices and may morph. But the concept of one of more BMHs being nested inside a BMH is a use-case we will see emerge into large distributed information systems. I'd like to being the process of defining how such nested BMHs can be provisioned, even if we elect to NOT implement at this time. Anticipation in design may save us a huge headache at a later date. Thanks for listening and for interaction.

metal3-io-bot commented 3 years ago

Issues go stale after 90d of inactivity. Mark the issue as fresh with /remove-lifecycle stale. Stale issues will close after an additional 30d of inactivity.

If this issue is safe to close now please do so with /close.

/lifecycle stale

metal3-io-bot commented 3 years ago

Stale issues close after 30d of inactivity. Reopen the issue with /reopen. Mark the issue as fresh with /remove-lifecycle stale.

/close

metal3-io-bot commented 3 years ago

@metal3-io-bot: Closing this issue.

In response to [this](https://github.com/metal3-io/metal3-docs/issues/126#issuecomment-730799318): >Stale issues close after 30d of inactivity. Reopen the issue with `/reopen`. Mark the issue as fresh with `/remove-lifecycle stale`. > >/close Instructions for interacting with me using PR comments are available [here](https://git.k8s.io/community/contributors/guide/pull-requests.md). If you have questions or suggestions related to my behavior, please file an issue against the [kubernetes/test-infra](https://github.com/kubernetes/test-infra/issues/new?title=Prow%20issue:) repository.

metal3-io / metal3-docs

Handling Special Case Compute Devices #126