ffjlabo commented 1 month ago

What happened:

In the PipeCD Application that uses Helm charts, an issue occurred where the version of the Helm chart was not updated even if the version was updated and deployed via PipeCD.

What you expected to happen:

How to reproduce it:

Environment:

piped version: v0.48.3
control-plane version: v0.48.4
k8s version: 1.28.11
Others:

ffjlabo commented 1 month ago

This incident was due to a part of PipeCD not considering the kustomize specifications. It can mainly occur when deploying a helm chart using kustomize (< 5.3.0) & enable-helm. Also, since it is a flaky phenomenon, it seems that it will not always be reproduced.

work around

Use kustomize 5.3.0 or higher

ffjlabo commented 1 month ago

Investigation

Overview

The premise is that during driftdetection, a Git repository is cloned and then pulled and reused. Additionally, a common manifest cache is used for driftdetection, plan preview, plan, and deploy operations. Therefore, if a wrong manifest is cached in a state, it may affect other processes. When you run kustomize build --enable-helm, a charts directory is created directly under the executed directory, and the Helm chart is downloaded locally once. The build result is based on this chart, but once the charts directory is created, it will not be updated even if you change the Helm chart version in kustomization.yaml by using kustomize before v5.3.0. Because driftdetection reuses repositories, it ends up loading manifests with no updated version. The loaded manifest is cached based on the commit hash, so if Plan or Deploy is executed referring to the same commit hash, the wrong manifest will be applied.

The situation to repro

We can reproduce it with the situation below.

Create PR to update helm chart version & merge
Drift detection runs and a manifest is generated by kustomize (< v5.3.0) (the contents are incorrect at this point)
Cache the manifest during the drift detection
Manifest cached during Plan is used
Cached manifest is used when executing Deploy

Note: If executed in the order of Plan -> drift detection, no problem will occur.

Example

I reproduced it by using opentelenetry-operator. https://github.com/open-telemetry/opentelemetry-helm-charts/tree/main/charts/opentelemetry-operator

helm chart version before correction ↓

# kustomization.yaml
~/works/ffjlabo/ffjlabo-dev/kubernetes/opentelemetry-operator [fujiwo]
% cat kustomization.yaml
apiVersion: kustomize.config.k8s.io/v1beta1
kind: Kustomization
labels:
  - pairs:
      app.kubernetes.io/owner: test
helmCharts:
  - name: opentelemetry-operator
    repo: https://open-telemetry.github.io/opentelemetry-helm-charts
    version: v0.29.0
    releaseName: opentelemetry-operator
    namespace: opentelemetry-operator-system
    includeCRDs: true

# manifest built for Deployment
~/works/ffjlabo/ffjlabo-dev/kubernetes/opentelemetry-operator [fujiwo]
% kustomize build --enable-helm . | grep -A 10 "kind: Deployment"
kind: Deployment
metadata:
  labels:
    app.kubernetes.io/component: controller-manager
    app.kubernetes.io/instance: opentelemetry-operator
    app.kubernetes.io/managed-by: Helm
    app.kubernetes.io/name: opentelemetry-operator
    app.kubernetes.io/owner: test
    app.kubernetes.io/version: 0.76.1
    helm.sh/chart: opentelemetry-operator-0.29.0
  name: opentelemetry-operator

# charts dir
~/works/ffjlabo/ffjlabo-dev/kubernetes/opentelemetry-operator [fujiwo]
% tree -L 2 .
.
├── app.pipecd.yaml
├── charts
│   ├── opentelemetry-operator
│   └── opentelemetry-operator-0.29.0.tgz
└── kustomization.yaml

4 directories, 2 files

After updating the version of the helm chart in kustomization.yaml to v0.64.4, build in the same dir↓

# kustomization.yaml
~/works/ffjlabo/ffjlabo-dev/kubernetes/opentelemetry-operator [fujiwo]
% cat kustomization.yaml                                                
apiVersion: kustomize.config.k8s.io/v1beta1
kind: Kustomization
labels:
  - pairs:
      app.kubernetes.io/owner: ffjlabo
helmCharts:
  - name: opentelemetry-operator
    repo: https://open-telemetry.github.io/opentelemetry-helm-charts
    version: v0.64.4
    releaseName: opentelemetry-operator
    namespace: opentelemetry-operator-system
    includeCRDs: true
    valuesFile: ./values.yaml

# manifest built for Deployment
~/works/ffjlabo/ffjlabo-dev/kubernetes/opentelemetry-operator [fujiwo]
% kustomize build --enable-helm . | grep -A 10 "kind: Deployment"       
kind: Deployment
metadata:
  labels:
    app.kubernetes.io/component: controller-manager
    app.kubernetes.io/instance: opentelemetry-operator
    app.kubernetes.io/managed-by: Helm
    app.kubernetes.io/name: opentelemetry-operator
    app.kubernetes.io/owner: ffjlabo
    app.kubernetes.io/version: 0.76.1
    helm.sh/chart: opentelemetry-operator-0.29.0
  name: opentelemetry-operator

# charts dir
~/works/ffjlabo/ffjlabo-dev/kubernetes/opentelemetry-operator [fujiwo]
% tree -L 2 .                                                           
.
├── app.pipecd.yaml
├── charts
│   ├── opentelemetry-operator
│   └── opentelemetry-operator-0.29.0.tgz
├── kustomization.yaml
└── values.yaml

4 directories, 3 files

This is like the spec in kustomize < v5.3.0. Currently, the specifications have changed, and it seems that even if an older version exists in the charts dir, it is now possible to build using the updated helm chart. ref: https://github.com/kubernetes-sigs/kustomize/pull/5293

ffjlabo commented 1 month ago

I think there are some solutions, but mainly making a fix for the drift detection process.

clone the repo from the git provider for every drift detections
select kustomize version after v5.3.0 on app.pipecd.yaml
remove charts dir after every drift detection

1 is not ideal because the drift detection is executed per 1min. 2 is also not ideal because this is just a workaround, the users who uses older kustomize version can't solve the problem

So I will try 3 solution.

pipe-cd / pipecd

When deploying the helm chart with kustomize, sometimes the helm chart version cannot be updated #5124

Investigation

Overview

The situation to repro

Example