siderolabs / omni

SaaS-simple deployment of Kubernetes - on your own hardware.
Other
397 stars 23 forks source link

automatic machine provisioning #270

Open smira opened 1 month ago

smira commented 1 month ago

Provisioning

  1. Omni creates a MachineRequest to provision a Machine in some cloud with some settings (region, size, ...)
  2. Omni Cloud Provider (a plugin to Omni) picks up the MachineRequest, and provisions the machine in the cloud.
  3. Omni Cloud Provider creates MachineRequestStatus describing the future machine which will join Omni (e.g. it has the UUID of the Machine that is going to be created/going to join Omni).
  4. As a provisioned Machine joins Omni, provider matches it with the request and labels/annotates it as requested (for example, label indicates that this Machine should be part of a MachineClass and it will join the MachineSet).

De-provisioning

  1. Machine is removed in the cloud --> Omni Cloud Provider picks it up and starts removing Machine from Omni.
  2. MachineRequest is removed -> Cloud Provider removes the Machine, removes machine in the cloud, cleans up everything.
  3. Machine is removed from Omni (e.g. faulty hardware) -> Omni Cloud Provider cleans it in the cloud.
smira commented 1 month ago

Omni Cloud Provider Deployment

  1. Might run with Omni (customer cloud access keys are given to Omni)
  2. Might run outside of Omni (auths to Omni using a service account, access keys are within external deployment)
smira commented 1 month ago

Omni Cloud Provider

  1. Connects to Omni API (some specific role, like 'Cloud Provider').
  2. Fetches cloud provider config resource from Omni (or uses config supplied via the environment). Cloud Provider should publish a schema of its expected config.
  3. Watches MachineRequest resources with specific label (targeting this provider ID).
  4. If this is a new MachineRequest that needs to be satisfied:
    • puts a finalizer on the MachineRequest
    • creates a MachineRequestStatus in not ready state, uses that to store internal progress status
    • fetches SideroLinkJoinParams resource from Omni to figure out how to make machine join Omni
    • starts provisioning a machine in the cloud, it might update MachineRequestStatus with some data to restart operations on failure
    • once the machine is ready (booting), update MachineRequestStatus to be ready and put the UUID of the machine which is going to join Omni (or labels the machine in a unique way); Omni should be able to match a joining Machine back to the MachineRequest
  5. If this is a tearing down MachineRequest:
    • destroy the infrastructure
    • destroy the MachineRequestStatus resource
    • remove the finalizer on MachineRequest
smira commented 1 month ago

MachineRequest Resource

metadata:
   id: machine-xyz
   type: MachineRequests
   labels:
      provider-id: <aws>
spec:
   talosVersion: v1.7.3
   schematic:
      extensions:
          - bnx2-firmware
      customization:
   provider:
     <unstructured, specific to the provider>
     <provider publishes a schema>
smira commented 3 weeks ago

Machine operations

metadata:
   # Omni-managed
   id: machine-xyz
   type: MachineControlSpec
spec:
   poweredOn: true/false # current status
   rebootID: 33 # need better name, if changes, new reboot is executed
---
metadata:
   # Cloud provider-managed
   id: machine-xyz
   type: MachineControlStatus
spec:
     poweredOn: true/false # current status
     lastRebootID: 33 # if matches rebootID, the last reboot got executed