csantanapr commented 3 months ago

GitOps Bridge Enhancement Proposal

Background

The GitOps Bridge is a project that collects examples and patterns demonstrating how to use ArgoCD Application Sets, particularly when working with Cloud Provider Managed Kubernetes. Based on feedback from field engagements, several improvements have been identified for the next major version.

Objectives

Simplify addon management
Improve version control and testing
Enhance compatibility and reduce maintenance overhead
Streamline single cluster deployments

Proposed Improvements

1. Implement Addon Stacks

Introduce the concept of "stacks" for deploying addons
A stack defines tested versions of addons that work together for specific Kubernetes versions
Users can select which addons to install from a stack
Stacks define versions, not which addons get installed

2. Version Promotion via Stacks

Promote addon versions as part of stacks
Clusters adopt a stack version from the platform team
Single addon updates require testing and approval within a new stack version

3. Centralized Version Management

Consolidate addon/helm chart versions into a single file
Benefits:
- Easier automation of addon upgrades
- Simplified creation of new stacks as helm charts are released
- Streamlined process for pushing all addons (helm and images) to private/airgapped OCI or HTTP helm repositories

4. Reduce ApplicationSet Duplication

Create a helm chart to:
1. Select addons to install (enable/[true|false])
2. Improve maintenance of ApplicationSets
Reduces the need to maintain multiple similar ApplicationSet files

5. CNOE Compatibility

Ensure compatibility with the CNOE IDPbuilder tool and project

6. Single Cluster Deployment Improvements

Address issues with deploying all ApplicationSets in single cluster scenarios:
- Reduce noise when there are no spokes for optional addon deployment
- Remove cluster selector using the label enable_$addon_name
- Control addon deployment by installing or not installing the ApplicationSet

7. Applications from ApplicationSet Dependency (order/sync-wave)

Address issues with an application depending on another application from a different application set
Implement annotation on Application depends_on: "otherapp1,otherapp2"
- validation webhook will check that dependencies are healthy before allowing to continue

8. Deploying resources before and after the addon

Address issues of deploying resources after addon is deployed, like kyverno policies or karpenter nodeclass or nodepool

9. Complete Airgapped environments

Show how to configure ArgoCD to fetch helm charts and corresponding images from private OCI repository (ie. ECR)
Show how to build a CICD pipeline that replicates addons from public to private OCI registry

10. Cross Accounts configurations (AWS Specififc)

Show examples on how to use EKS Pod Identity when using ArgoCD from Central Account to manage Clusters located in different accounts

11. Additional examples for other IaC tools

Add additional IaC tools like CAPI/Crossplane/ACK examples, currently we only have examples for two Terraform and Pulumi

Feedback and Discussion

Please comment on this issue with your thoughts, suggestions, or concerns regarding these proposed improvements. We welcome all feedback to ensure the GitOps Bridge v2 meets the needs of our users and aligns with best practices in GitOps and Kubernetes management.

csantanapr commented 3 months ago

Initial POC here https://github.com/csantanapr/cnoe-examples/tree/main/examples/gitops-bridge

blakeromano commented 2 months ago

Based on POC I think there's some definite wins with the improvements made here but I have maybe a few foods for thought or concerns that I think we should think about...

Extensibility of Addons

When dealing with external charts there is usually a need I have found to potentially add additional resources to your chart, for example things like HTTPRoutes if you are using the Gateway API (which a lot of helm charts haven't added support for that or specifying additionalResources like Argo charts allow.

I am struggling here in this current implementation to see how I could point to a locally managed helm chart and say "add this additional helm chart to my addon". Which I have found to be a common need where I may not want to add an additional X number of applications just to create an HTTPRoute resource.

Other General Comments/Questions

Control addon deployment by installing or not installing the ApplicationSet

I think this makes sense to single cluster deployments I also just want to be cognizant of maybe the overhead of another file to manage to enable even initial install of an addon to a cluster can be.

Address issues with an application depending on another application from a different application set

I wonder why we can't just solve the dependency problem by just making addons that rely on each other apart of the same application then managing using ArgoCD sync waves. For example Karpenter, you have helm chart go first then you can add a source for your provisioner that has an annotation to go on a later sync wave.

Final Note

To me I think the biggest problem with the proposal is we still have a very 1:1 mapping of applicationset to a single helm source where I wonder why we wouldn't want to consolidate the number of applications we ultimately make by saying "I want to deploy a helm chart for Karpenter, and my provisioner helm chart, and my CNI all in one Applicationset". And then we have the Application be a "Stack" that can be redeployed with slightly different config to X number of clusters.

csantanapr commented 2 months ago

Thanks for the feedback @blakeromano

Yes I'm looking into adding additional resources to an addon like kubernetes nodeclass.

I intend to add this under "8. Deploying resources before and after the addon"

nabuskey commented 2 months ago

Implement annotation on Application depends_on: "otherapp1,otherapp2"

validation webhook will check that dependencies are healthy before allowing to continue

Is this something you are implementing for yourself or part of Argo project? For my use cases, I was going to use sync-wave with app of application sets patterns but it seemed a bit limiting. I try to use established tools for dependency management stuff because it can get messy quick. Is there a problem with the app of application set pattern here?

Complete Airgapped environments

I'd love to see this for sure!

I am struggling here in this current implementation to see how I could point to a locally managed helm chart and say "add this additional helm chart to my addon".

I tend to agree with this. Something simple like a small script to push to a repo might be ok for the first implementation. But something a bit more guided / frame worked approach might be needed in the future.

csantanapr commented 2 months ago

@nabuskey I want to implement an experimental validation webhook to delay the entry of an app until dependencies are met.

ArgoCD upstream is working on a ondependsOn feature and here is the PR https://github.com/argoproj/argo-cd/pull/15280

From @crenshaw-dev review https://github.com/argoproj/argo-cd/pull/15280#pullrequestreview-1642430573

I'm generally uncomfortable adding another ordering mechanism to Argo CD. I think that we risk moving folks away from "eventual consistency" models of deployment in k8s back to an imperative CI-type model, just built in Argo CD instead of in Jenkins. I think if we push forward with this feature, we should write documentation outlining use cases where this tool makes sense as well as a list of alternatives which could provide a better experience (controllers that enforce ordering, validating webhooks that block deployment until prerequisites are met, readiness checks which prevent apps from spinning up too soon, etc.).

I want to implement the validating webhooks that block deployment until prerequisites are met If the feature covers my the gitops-bridge corner cases where we must define a dependency, then I can replace my experimental feature and replace with upstream solution.

nabuskey commented 2 months ago

What Michael said makes a lot of sense to me. This is something Crossplane project has struggled with for a long time as well. Often, the most significant use case for defined order is resource deletion. Creations are usually fine because of the reconciliation loop. For this project, can you not get away with eventual consistency?

Because I think adding dependency management stuff can become very messy, I must ask. Is the validation webhook correct place for this kind of thing? Is the pattern of having all orders in ArgoCD something we want to promote? You can swap out dependency management implementation, but once you introduce something like this, it's difficult to take back.

csantanapr commented 2 months ago

I agree with @nabuskey that ensuring eventual consistency and configuring ArgoCD apps to retry should be the best practice. This means that using the sync wave should be avoided at all costs. The goal of the GitOps bridge is to help users learn about GitOps best practices while working with a specific list of popular addons.

The framework will not come with a dependency graph out of the box, but it will provide a way to implement dependencies using different strategies. This will be especially useful in situations where it's impossible to use eventual consistency when bootstrapping a new cluster.

We will document and provide examples of these strategies. I agree with you that we don't want to promote using ordering in ArgoCD. However, we can't ignore that users face problems, and they might think that the only solution is ordering. We need to document and provide examples of these corner cases and explore if they can be solved using eventual consistency or if one of the strategies needs to be used.

dtzar commented 2 months ago

One major complaint I have about GitOps bridge repo structure is that there is no workflow accounted for provisioning infrastructure itself using a Kubernetes control plane (e.g. Crossplane or Cluster API). It predominantly assumes you will manually provision all clusters with Terraform which doesn't really scale well. It also makes the folder structure confusing. For instance "clusters" is used to not provision new clusters for a team, but what gets installed on specific clusters overriding environment.

csantanapr commented 2 months ago

Good point @dtzar I agree the project is missing the IaC examples using other tools other than terraform.

I agree we should add this as part of the enhancement proposal, I created the folder structure as place holders thinking on building them.

Which IaC would you prioritized?

I’m thinking on CAPI/CAPA to implement first

dtzar commented 2 months ago

Our implementation went public yesterday: https://github.com/Azure-Samples/aks-platform-engineering

It might help to have a live chat/sync on this subject

csantanapr commented 2 months ago

Nice @dtzar ! yes lets connect on Slack

BENMALEKarim commented 2 months ago

The enhancement proposal makes sense to me, particularly the idea of having an addon stack related to the Kubernetes version and a single Helm chart to facilitate addon management.

I have started working on a single ApplicationSet Helm chart. Here is a brief summary of the concept I want to establish:

Single ApplicationSet Template: A unified ApplicationSet template that will generate all ApplicationSets for all enabled addons.
Versioned Chart for Kubernetes Versions: Each chart version corresponds to a specific Kubernetes version and sets specific versions for the addons. For example:
- gitops-addons chart v1.30 will use aws_load_balancer_controller v1.8.1.
- gitops-addons chart v1.29 will use aws_load_balancer_controller v1.7.1.
Override Values: These values, and more, can be overridden per cluster or per environment using values files (per tenant/cluster/environment).
Direct Chart Usage: Users of the gitops-addons chart will use it directly from the chart repository without needing to clone and integrate it into their own repository.
Values File Format for Addons: The addons chart will include a values file formatted as follows:

addons:
  aws_load_balancer_controller:
    enabled: false
    releaseName: aws-load-balancer-controller
    chart: aws-load-balancer-controller
    repoUrl: https://aws.github.io/eks-charts
    targetRevision: "1.8.1"
    namespace: kube-system
    values:
      vpcId: '{{.metadata.annotations.aws_vpc_id}}'
      clusterName: '{{.metadata.annotations.aws_cluster_name}}'
      serviceAccount:
        name: '{{.metadata.annotations.aws_load_balancer_controller_service_account}}'
        annotations:
          eks.amazonaws.com/role-arn: '{{.metadata.annotations.aws_load_balancer_controller_iam_role_arn}}'
  external_secrets:
    enabled: false
    releaseName: external-secrets
    namespace: external-secrets
    chart: external-secrets
    repoUrl: https://charts.external-secrets.io
    targetRevision: "0.10.0"
    values:
      serviceAccount:
        name: '{{.metadata.annotations.external_secrets_service_account}}'
        annotations:
          eks.amazonaws.com/role-arn: '{{.metadata.annotations.external_secrets_iam_role_arn}}'

  aws_ingress_nginx:
    enabled: false
    releaseName: ingress-nginx
    namespace: ingress-nginx
    chart: ingress-nginx
    repoUrl: https://kubernetes.github.io/ingress-nginx
    targetRevision: "4.11.1"
    values:
      controller:
        service:
          type: LoadBalancer
          annotations:
            service.beta.kubernetes.io/aws-load-balancer-nlb-target-type: ip
            service.beta.kubernetes.io/aws-load-balancer-scheme: internet-facing
          loadBalancerClass: service.k8s.aws/nlb

Template ApplicationSet Format: The template ApplicationSet will be structured similarly:

{{- range $name, $addon := .Values.addons }}
{{- if $addon.enabled }}
{{- $nameNormalize := printf "%s" $name | replace "_" "-" | trunc 63 | trimSuffix "-"  -}}
apiVersion: argoproj.io/v1alpha1
kind: ApplicationSet
metadata:
  name: {{ $nameNormalize }}
  namespace: argocd
spec:
  goTemplate: true
  goTemplateOptions: ["missingkey=error"]
  generators:
  - clusters:
      selector:
        matchExpressions:
          - key: enable_{{ $name }}
            operator: In
            values: ['true']
  template:
    metadata:
      name: addon-{{ $nameNormalize }}
      namespace: argocd
    spec:
      destination:
        server: "https://kubernetes.default.svc" 
        namespace: {{ $addon.namespace }}
      project: default
      source:
        repoURL: {{ $addon.repoUrl }}
        chart: {{ $addon.chart }}
        targetRevision: {{ $addon.targetRevision }}
        helm:
          releaseName: {{ $addon.releaseName }}
          {{- with $addon.values }}
          valuesObject:
            {{- toYaml . | nindent 12 }}
          {{- end }}
      syncPolicy:
        automated:
          prune: true
          selfHeal: true
        syncOptions:
        - CreateNamespace=true
---
{{- end }}
{{- end }}

Global ApplicationSet Format: The global ApplicationSet that will call our addons chart will be formatted as:

apiVersion: argoproj.io/v1alpha1
kind: ApplicationSet
metadata:
  name: gitops-addons-appset
  namespace: argocd
spec:
  goTemplate: true
  goTemplateOptions: ["missingkey=error"]
  generators:
  - clusters:
      selector:
        matchExpressions:
          - key: cluster_name
            operator: In
            values: [in-cluster]
  template:
    metadata:
      name: gitops-addons
    spec:
      destination:
        server: "https://kubernetes.default.svc" 
        namespace: argocd
      project: default
      source:
        repoURL: 'https://benmalekarim.github.io/helm-scenarios-charts/' 
        chart: 'gitops-addons'
        targetRevision: '1.30.5'
        helm:
          releaseName: gitops-addons
          valuesObject:
            addons:
              aws_load_balancer_controller:
                enabled: '{{ .metadata.labels.enable_aws_load_balancer_controller }}'
              external_secrets:
                enabled: '{{ .metadata.labels.enable_external_secrets }}'
              aws_ingress_nginx:
                enabled: '{{ .metadata.labels.enable_ingress_nginx }}'
      syncPolicy:
        automated:
          prune: true
          selfHeal: true
        syncOptions:
        - CreateNamespace=true

By following these specifications, we will address the first four points of improvements effectively. I have a small example that reproduces these specifications. I am open to discussing and contributing further.

Please let me know your thoughts and any suggestions for additional scenarios or improvements.

csantanapr commented 1 month ago

@BENMALEKarim thanks for the proposal.

I think we can do better than having different chart versions per cluster version.

Here is a gitops-bridge helm chart that addresses the versions via overrides, its POC as we are testing in a workshop environment https://github.com/aws-samples/fleet-management-on-amazon-eks-workshop/tree/riv24/gitops/addons/charts/gitops-bridge

gitops-bridge-dev / gitops-bridge