feat: add ability to define deployment strategy

ericgraf commented 9 months ago

What type of PR is this?

Feature

Which issue does this PR fix:

What does this PR do / Why do we need it:

This PR introduces the ability to specify the upgrade strategy of the Zot deployment.

Strategy recreate needs to be set when trying to upgrade a single replica deployment with a PV defined.

In ReadWriteOnce accessMode pvcs the new pod will be stuck in creating until the pod with the pvc is manually deleted. In ReadWriteMany accessMode pvcs the new pod will go into a crash loop with the error operation timeout: boltdb file is already in use.

If an issue # is not available please add repro steps and logs showing the issue:

Run the below commands to repoduce the failure.

#!/usr/bin/env bash

kind create cluster
helm repo add project-zot http://zotregistry.dev/helm-charts
helm repo update

# Install with an older zot image so it can be upgraded, also create a pvc.
helm install zot project-zot/zot \
 --set persistence=true \
 --set pvc.create=true \
 --set image.tag=v2.0.1-rc2

#Wait for pod/deployment to be ready
kubectl rollout status deployment zot
kubectl wait --for=condition=ready pod -l app.kubernetes.io/name=zot --timeout=120s

#Upgrade zot image to a new image to roll the k8s deployment.
helm upgrade zot project-zot/zot \
    --reuse-values \
    --set image.tag=v2.0.1

#New pod will get stuck in `CrashLoopBackOff ` state.
kubectl get pod

#Wait for pod to be ready
kubectl rollout status deployment zot
kubectl wait --for=condition=ready pod -l app.kubernetes.io/name=zot --timeout=120s

#cleanup
#kind delete cluster

New Pod is in a bad state

[egraf@gs-vm-1 zot (⎈|kind-kind:N/A)]$ k get pods
kind-kind
NAME                   READY   STATUS             RESTARTS      AGE
zot-64f666cfcb-d4w7k   1/1     Running            0             5m30s
zot-7977df978b-hjqdw   0/1     CrashLoopBackOff   3 (27s ago)   4m58s
[egraf@gs-vm-1 zot (⎈|kind-kind:N/A)]$ k get rs
kind-kind
NAME             DESIRED   CURRENT   READY   AGE
zot-64f666cfcb   1         1         1       5m33s
zot-7977df978b   1         1         0       5m1s

Crash looping pod error logs:

egraf@gs-vm-1:~$ k logs zot-7977df978b-thgfr
kind-kind
{"level":"info","url":"ghcr.io/aquasecurity/trivy-db","component":"config","goroutine":1,"caller":"zotregistry.io/zot/pkg/cli/server/root.go:585","time":"2024-01-30T20:10:42.450916242Z","message":"using default trivy-db download URL."}
{"level":"info","url":"ghcr.io/aquasecurity/trivy-java-db","component":"config","goroutine":1,"caller":"zotregistry.io/zot/pkg/cli/server/root.go:592","time":"2024-01-30T20:10:42.450951586Z","message":"using default trivy-java-db download URL."}
{"level":"warn","goroutine":1,"caller":"zotregistry.io/zot/pkg/cli/server/root.go:318","time":"2024-01-30T20:10:42.450977361Z","message":"mgmt extensions configuration option has been made redundant and will be ignored."}
{"level":"info","params":{"distSpecVersion":"1.1.0-dev","GoVersion":"go1.20.13","Commit":"v2.0.1-9def35f3b8b38f5a438146a09087edcec00537c7","ReleaseTag":"v2.0.1","BinaryType":"-imagetrust-lint-metrics-mgmt-profile-scrub-search-sync-ui-userprefs","Storage":{"RootDirectory":"/var/lib/registry","Dedupe":true,"RemoteCache":false,"GC":true,"Commit":false,"GCDelay":3600000000000,"GCInterval":3600000000000,"Retention":{"DryRun":false,"Delay":0,"Policies":null},"StorageDriver":null,"CacheDriver":null,"SubPaths":null},"HTTP":{"Address":"0.0.0.0","ExternalURL":"","Port":"5000","AllowOrigin":"","TLS":null,"Auth":{"FailDelay":0,"HTPasswd":{"Path":""},"LDAP":null,"Bearer":null,"OpenID":null,"APIKey":false},"AccessControl":null,"Realm":"","Ratelimit":null},"Log":{"Level":"debug","Output":"","Audit":""},"Extensions":{"Search":{"Enable":true,"CVE":{"UpdateInterval":7200000000000,"Trivy":{"DBRepository":"ghcr.io/aquasecurity/trivy-db","JavaDBRepository":"ghcr.io/aquasecurity/trivy-java-db"}}},"Sync":null,"Metrics":null,"Scrub":null,"Lint":null,"UI":{"Enable":true},"Mgmt":{"Enable":true},"APIKey":null,"Trust":null},"scheduler":null},"goroutine":1,"caller":"zotregistry.io/zot/pkg/api/controller.go:221","time":"2024-01-30T20:10:42.451353936Z","message":"configuration settings"}
{"level":"info","cpus":16,"max. open files":1048576,"listen backlog":"4096","max. inotify watches":"524288","goroutine":1,"caller":"zotregistry.io/zot/pkg/api/controller.go:90","time":"2024-01-30T20:10:42.451463363Z","message":"runtime params"}
{"level":"error","error":"operation timeout: boltdb file is already in use, path '/var/lib/registry/cache.db'","dbPath":"/var/lib/registry/cache.db","goroutine":1,"caller":"zotregistry.io/zot/pkg/storage/cache/boltdb.go:58","time":"2024-01-30T20:10:52.404668952Z","message":"failed to create cache db"}
{"level":"error","error":"operation timeout: boltdb file is already in use, path '/var/lib/registry/cache.db'","goroutine":1,"caller":"zotregistry.io/zot/pkg/cli/server/root.go:74","time":"2024-01-30T20:10:52.404773952Z","message":"failed to init controller"}
Error: operation timeout: boltdb file is already in use, path '/var/lib/registry/cache.db'
Usage:
  zot serve <config> [flags]

Aliases:
  serve, serve

Flags:
  -h, --help   help for serve

Testing done on this change:

The below commands were run with code from this PR. The deployment strategy is set to recreate in the commands which allowed the deployment to successfully scale down the old replicaset, then spins up the new replicaset.

#!/usr/bin/env bash

kind create cluster

# Install with an older zot image , pvc create , and recreate strategy set.
# this is run from the charts/zot directory
 helm install zot ./ \
 --set persistence=true \
 --set pvc.create=true \
 --set strategy.type=Recreate \
 --set strategy.rollingUpdate=null \
 --set image.tag=v2.0.1-rc2

#Wait for pod to be ready
kubectl rollout status deployment zot
kubectl wait --for=condition=ready pod -l app.kubernetes.io/name=zot --timeout=120s

#Upgrade zot image to a new image to roll the k8s deployment.
helm upgrade zot ./ \
    --reuse-values \
    --set image.tag=v2.0.1

kubectl get pod

#Wait for pods to be ready
kubectl rollout status deployment zot
kubectl wait --for=condition=ready pod -l app.kubernetes.io/name=zot --timeout=120s

# cleanup
#kind delete cluster

Final state after doing the upgrade test.

[egraf@gs-vm-1 zot (⎈|kind-kind:N/A)]$ k get pods
kind-kind
NAME                   READY   STATUS    RESTARTS   AGE
zot-7977df978b-nhltl   1/1     Running   0          16s
[egraf@gs-vm-1 zot (⎈|kind-kind:N/A)]$ k get rs
kind-kind
NAME             DESIRED   CURRENT   READY   AGE
zot-64f666cfcb   0         0         0       4m36s
zot-7977df978b   1         1         1       19s

Automation added to e2e:

Two new tests were added:

Upgrade tests of the helm chart this was added in the github actions workflow.
Deployment strategy test using the rolling option.

Will this break upgrades or downgrades? No. This change sets the values in values.yaml to the deployment strategy values in the values,yaml file.

Does this PR introduce any user-facing change?:

Add ability to specify the deployment upgrade strategy.

By submitting this pull request, I confirm that my contribution is made under the terms of the Apache 2.0 license.

ericgraf commented 9 months ago

For this change there are still two things I'm still not sure about.

Where is it best to documentation about this change and how to use it?
I didn't add any e2e helm unittest. Do you have an example of how I could that for this change?

rchincha commented 9 months ago

@ericgraf Thanks for the PR.

Where is it best to documentation about this change and how to use it?

Our documentation is generated from here: https://github.com/project-zot/project-zot.github.io Perhaps best to add an article here.

I didn't add any e2e helm unittest. Do you have an example of how I could that for this change?

Maybe take two versions, start with the first one and upgrade to the next one. However, note that upgrades are enforced to happen within the same major version. For example, 2.0.1 -> 3.0.5 is risky!

ericgraf commented 9 months ago

Thank you @rchincha . I moved this PR back to draft until I have those two things done.

ericgraf commented 9 months ago

@rchincha I added two tests in this PR. One that follows the current ct install workflow and another one that introduces a ct install --upgrade step in the github actions workflow. The second one is a bit different, so let me know what you think about doing it this way.

rchincha commented 9 months ago

@Andreea-Lupu pls review this as well.

@ericgraf thanks for updating the documentation as well. https://github.com/project-zot/project-zot.github.io/pull/158

project-zot / helm-charts

feat: add ability to define deployment strategy #38