vmware-tanzu / community-edition

VMware Tanzu Community Edition is no longer an actively maintained project. Code is available for historical purposes only.
https://tanzucommunityedition.io/
Apache License 2.0
1.33k stars 307 forks source link

Add dynamic TKr resolution based on a compatibility file #3538

Closed jpmcb closed 2 years ago

jpmcb commented 2 years ago

Feature Request

Currently, when we ship unmanaged-cluster, we hardcode the TKr into the source code.

We should support inspecting a compatibility file to pull the correct TKr. This way, if we push a release with a broken TKr, we can update the TKr and compatibility file to then reference a working TKr. This would follow a similar model to how management-cluster works.

joshrosso commented 2 years ago

Some key details around this functionality:

  1. With the way production harbor is setup, we cannot rely on a tag like :latest to resolve the newest option. a. To solve this, we should rely on an incrementing tag in a consistent location. I recommend: project.registry.vmware.com/tce/compatibility. The files (bundles) uploaded here should be tagged :v1, v2, v3, etc. b. Our unmanaged-cluster should start by querying this location for available tags and choose the largest number. c. Based on that, we should store the compatibility file locally, named after the tag so we know whether it needs to be downloaded.

  2. We should add the default user-managed package repo into the TKr. We can do this by updating the tkr packages's definition of a TKR. This will enable us to change the default user-managed package repo across releases without recompiling unmanaged-cluster.

Initially, we'll give this one to @jpmcb and reassess as we get farther.

jorgemoralespou commented 2 years ago

Is incremental (:v1, v2, v3) a good tagging, or otherwise, use the version BOM version with a semver suffix better (e.g. :v0.11.0+b1)

joshrosso commented 2 years ago

The reason incrementing for compatibility files is compelling is because:

  1. We always want to capture the newest compatibility file. a. If we stay consistent with TKG, compatibility files are there to map the version of the CLI, to compatible TKr(s).

This means, that the CLI of 0.11.0 would pull the same compatibility file (say v9) as a version of the CLI, say 0.15.0.

Within v9, there would be mappings to compatible BOMs for each CLI version.

jorgemoralespou commented 2 years ago

Gotcha! Just saw that the version was for compatibility file and not for TKR, as I initially thought. My bad! Understood and makes sense, probably! I'll go and dig these files now!

jorgemoralespou commented 2 years ago

@jpmcb Not sure if your PR solves this issue, but:

When creating a cluster with config, you need to specify image to be used for the node, or otherwise it'll use kindest/node:v1.23.4. The image comes in the TKR, and a user would need to know how to extract that information out of the TKR manually to use in the kind config. Also, if TKR is incremented, but user still uses config file there's 2 options (depending on what the implementation is). The TKR, even if fixed to a specific version code looks for a newer one, in which case, node image will never be updated, or code is bypassed to check for an update so that user specified TKR is preserved, and node image as well.

Example config that will use Kind image always:

ClusterName: test
KubeconfigPath: ""
NodeImage: ""
Provider: kind
ProviderConfiguration:
  rawKindConfig: |
    kind: Cluster
    apiVersion: kind.x-k8s.io/v1alpha4
    nodes:
    - role: control-plane
      extraPortMappings:
      - containerPort: 80
        hostPort: 80
        listenAddress: "127.0.0.1"
        protocol: TCP
      - containerPort: 443
        hostPort: 443
        listenAddress: "127.0.0.1"
        protocol: TCP
Cni: calico
CniConfiguration: {}
PodCidr: 10.244.0.0/16
ServiceCidr: 10.96.0.0/16
TkrLocation: projects.registry.vmware.com/tce/tkr:v0.17.0
SkipPreflight: false
ControlPlaneNodeCount: "1"
WorkerNodeCount: "0"
ExistingClusterKubeconfig: ""

Example config that will use the image extracted from TKR:

ClusterName: test
KubeconfigPath: ""
NodeImage: ""
Provider: kind
ProviderConfiguration:
  rawKindConfig: |
    kind: Cluster
    apiVersion: kind.x-k8s.io/v1alpha4
    nodes:
    - role: control-plane
      image: projects.registry.vmware.com/tce/kind:v1.22.4
      extraPortMappings:
      - containerPort: 80
        hostPort: 80
        listenAddress: "127.0.0.1"
        protocol: TCP
      - containerPort: 443
        hostPort: 443
        listenAddress: "127.0.0.1"
        protocol: TCP
Cni: calico
CniConfiguration: {}
PodCidr: 10.244.0.0/16
ServiceCidr: 10.96.0.0/16
TkrLocation: projects.registry.vmware.com/tce/tkr:v0.17.0
SkipPreflight: false
ControlPlaneNodeCount: "1"
WorkerNodeCount: "0"
ExistingClusterKubeconfig: ""

Also note that in the first case, the image printed in the logs when creating the cluster is the TCE one although the one used is the Kind one.

$ tanzu uc create -f config.yml

๐Ÿ“ Created cluster directory
   Reading ProviderConfiguration from config file. All other provider specific configs may be ignored.

๐Ÿ”ง Resolving Tanzu Kubernetes Release (TKR)
   projects.registry.vmware.com/tce/tkr:v0.17.0
   TKR exists at /Users/jomorales/.config/tanzu/tkg/unmanaged/bom/projects.registry.vmware.com_tce_tkr_v0.17.0
   Rendered Config: /Users/jomorales/.config/tanzu/tkg/unmanaged/test/config.yaml
   Bootstrap Logs: /Users/jomorales/.config/tanzu/tkg/unmanaged/test/bootstrap.log

๐Ÿ”ง Processing Tanzu Kubernetes Release

๐ŸŽจ Selected base image
   projects.registry.vmware.com/tce/kind:v1.22.4

๐Ÿ“ฆ Selected core package repository
   projects.registry.vmware.com/tce/repo-10:0.10.0

๐Ÿ“ฆ Selected additional package repositories
   projects.registry.vmware.com/tce/main:v0.11.0

๐Ÿ“ฆ Selected kapp-controller image bundle
   projects.registry.vmware.com/tce/kapp-controller-multi-pkg:v0.30.1

๐Ÿš€ Creating cluster tce-dd
   Cluster creation using kind!
   โค๏ธ  Checkout this awesome project at https://kind.sigs.k8s.io
   Base image downloaded
   Cluster created
   To troubleshoot, use:

   kubectl ${COMMAND} --kubeconfig /Users/jomorales/.config/tanzu/tkg/unmanaged/test/kube.conf

๐Ÿ“ง Installing kapp-controller
   kapp-controller status: Running

๐Ÿ“ง Installing package repositories
   Core package repo status: Reconcile succeeded

๐ŸŒ Installing CNI
   calico.community.tanzu.vmware.com:3.22.1

โœ… Cluster created

๐ŸŽฎ kubectl context set to tce-dd

View available packages:
   tanzu package available list
View running pods:
   kubectl get po -A
Delete this cluster:
   tanzu unmanaged delete test

$ docker ps
8ed5180d437d   kindest/node:v1.23.4                    "/usr/local/bin/entrโ€ฆ"   2 minutes ago   Up 2 minutes   127.0.0.1:80->80/tcp, 127.0.0.1:443->443/tcp, 127.0.0.1:59191->6443/tcp   test-control-plane
joshrosso commented 2 years ago

When creating a cluster with config, you need to specify image to be used for the node, or otherwise it'll use kindest/node:v1.23.4. The image comes in the TKR, and a user would need to know how to extract that information out of the TKR manually to use in the kind config. Also, if TKR is incremented, but user still uses config file there's 2 options (depending on what the implementation is). The TKR, even if fixed to a specific version code looks for a newer one, in which case, node image will never be updated, or code is bypassed to check for an update so that user specified TKR is preserved, and node image as well.

This PR would not solve this.

The reason is that have rawKindConfig specified. When the kind provider sees rawKindConfig specified, it feeds that into kind directly without modifying it via any other settings. So, since image is not specified in your node configs, it is using kind's default.

A case could be made that there should be a "merge" of settings here, but we'd need a separate issue to track and assess that.

Thank you!

jorgemoralespou commented 2 years ago

@joshrosso should I open that issue?

jorgemoralespou commented 2 years ago

https://github.com/vmware-tanzu/community-edition/issues/3755