The cluster creation phase for v0.1.0 Tech Preview

gianarb commented 3 years ago

I want to use this issue to describe what I have in my mind and how I think we should implement the lifecycle for v0.1.0 Tech preview of CAPT.

First TInkerbellClusterSpec has 2 lists of hardware IDs coming from Tinkerbell one for the control plane and one for the data plane.

Based on what is required the machine controller will take an Id from the right list.

The first machine that CAPI creates is always a control plane, and we can follow the same implementation we made for CAPP during its first release. The first control plane IP will become the Custer Endpoint. Just for simplicity. (we can think about a HA like solution later)

From there everything else should work almost in the same way as any other provider implementation. We have to figure out how to correctly inject UserData in the metadata server (@mmlb can you give us some tips here).

To summarize the interaction with Tinkerbell in the Machine creation:

Identify available hardware ID from the list specified in the TinkerbellClusterSpec
If it is the first control plane set the Cluster Endpoint with the hardware IP.
Edit the metadata for that selected hardware ID storing the right UserData.
Create a workflow from the specified template (templateName is part of the TinkerbellMachineSpec)
Wait until the workflow is in success state (this means that the OS is ready
From there the OS is running and the OS should be able to read the user data persisted at point 3 and everything should work!

Temporary avoid CCM

Technically if Tinkerbell replaces a cloud provider we should have a CCM for Tinkerbell, but for now, we should try to label and annotate nodes with the ProviderID during machine creation if possible, even if it is not the right way, it is faster.

mmlb commented 3 years ago

Hmm well this is actually a tough question as to get user data you need to stuff it into hardware.metadata.instance.userdata. Which is one of those hidden packet ways things that is going on that I want to get rid of but I think needs to be done with a proper "Instance" object hence the instance proposal. Because as it is right now there's no well-defined way of doing it, on purpose.

As things stand today, this is in the "up to the operator how to serve and how to fetch it via workflow" area, at least w/o going through with the proposal. Hegel has an envvar tunable where the operator can configure where to grab the data from to serve both /metadata and /userdata endpoints. BUT no such var exists for the EC2 style metadata requests cloud-init does. So that means a code change to Hegel instead of just an envvar change.

IMO its just easier/better for everyone involved if we define some kind of OS+unique-per-installa-info like we do in EquinixMetal land.

And I think the hardware.metadata.instance.userdata (hegel's default today) should then become hardware.instance.userdata if the proposal is accepted (or w/e name we come up with) which means it'll impact CAPT.

gianarb commented 3 years ago

Thanks, @mmlb for the context you shared with us. Now, I am sure the instance proposal will be accepted and in any case the unstructured metadata object as it is today won't stay forever. No matter what I presume it means a BC break. A "not that small one".

So, this should not stop us to get CAPT v0.1.0 Tech Preview out, because Tinkerbell is not v1.0.0 and we expect things to change today.

So how do we do it? If we pass a user data today as hardware.metadata.instance.userdata will Ubuntu (not the packet one, but the official one) pick up the user data in the right way? or not?

Even if we accept the instance proposal today I presume the actual implementation will be a 2021 kind of thing, right? CAPT v0.1.0 has to move its first steps in 2020.

IMO its just easier/better for everyone involved if we define some kind of OS+unique-per-installa-info like we do in EquinixMetal land.

Can you elaborate because I don't know what we do at EquinixMetal land.

mmlb commented 3 years ago

Thanks, @mmlb for the context you shared with us. Now, I am sure the instance proposal will be accepted and in any case the unstructured metadata object as it is today won't stay forever. No matter what I presume it means a BC break. A "not that small one".

Very much not a small one :D.

So, this should not stop us to get CAPT v0.1.0 Tech Preview out, because Tinkerbell is not v1.0.0 and we expect things to change today.

So how do we do it? If we pass a user data today as hardware.metadata.instance.userdata will Ubuntu (not the packet one, but the official one) pick up the user data in the right way? or not?

It should work but you'd need to go in and configure cloud-init after laying down the image. We need tell cloud-init where and in what format it can fetch the cloud-config data. Here's how we do it in EM production today: https://github.com/tinkerbell/osie/blob/master/docker/scripts/osie.sh#L349-L355

Even if we accept the instance proposal today I presume the actual implementation will be a 2021 kind of thing, right? CAPT v0.1.0 has to move its first steps in 2020.

Yep

IMO its just easier/better for everyone involved if we define some kind of OS+unique-per-installa-info like we do in EquinixMetal land.

Can you elaborate because I don't know what we do at EquinixMetal land.

Lets have this part over at the proposal instead.

detiber commented 3 years ago

Closing, as this has already been completed for the initial tech preview.

tinkerbell / cluster-api-provider-tinkerbell

The cluster creation phase for v0.1.0 Tech Preview #3

Temporary avoid CCM