Closed hairmare closed 7 months ago
Hello @hairmare! Thank you for your proposal. We will look into this and come back to you when we can provide more information on this topic. Greetings, Moritz
I have since noticed that a bunch of other resources (mainly secrets) also have a helm-hook annotation and would like to also propose removing the annotation from those resources.
The main reason (in addition to the WaitForFirstConsumer
) is that i'm deploying the chart using Argo CD (aka OpenShift GitOps) and that Argo CD is trying to remove hook resources from the previous sync before it starts the next sync. This makes it non-trivial to use the chart with Argo CD to rollout changes to an already deployed instance of central-services.
Hey @mtesseract Were you able to get any feedback from the team? Argo CD still wants to remove important parts of the deployment with every update and i'd be happy to provide a fix. Thanks, Lucas
Hi Lucas,
sorry that I wasn't able to get back to you earlier. So, I've had a first look at this and would like to describe my thoughts around this. Please correct me if I got something wrong (please note that I haven't used Argo CD or Azure AKS before).
I have the feeling that we are dealing with two issues here, is this right?
First would be about the WaitForFirstConsumer
setting and how the Chart could be installed when WaitForFirstConsumer
is activated for the involved PVC.
Second would be the problem with deleting Helm hooks during an Argo CD sync.
Right now I have only thought about the second.
So, I checked the Argo CD documentation in order to learn something about how Argo CD deals with hooks.
I found https://argo-cd.readthedocs.io/en/stable/user-guide/resource_hooks/#hook-deletion-policies and according to this page BeforeHookCreation
would be used as deletion policy by default.
When looking at https://argo-cd.readthedocs.io/en/stable/user-guide/helm/#helm-hooks it seems to me that the hook-delete-policy
simply carries over from the Helm hook to an Argo CD hook -- and if it is not present, then this would default to BeforeHookCreation
, which would explain the behaviour you are observing. Correct so far?
I checked if we use hook-deletion-policies currently and the only such annotations I could find are of the form https://github.com/stackrox/stackrox/blob/master/image/templates/helm/stackrox-central/templates/01-central-11-pvc.yaml#L27, which is not converted correctly into Argo CD world.
In general: We must be very careful about changing the semantics of the hook resources as current users might depend on the current behaviour. For example, a PV is expected to not be destroyed when a Helm chart is deinstalled. Secrets are another example of resources which we do not want to be deleted implicitly.
Now, I wonder what would happen if we keep the Hook nature of the current Hook resources and add a deletion policy of hook-failed
-- would this prevent these resources from being deleted by Argo CD on a sync?
Thanks Moritz
@hairmare can you please comment on @mtesseract 's response?
I'd like to follow-up on this one.
Regarding the WaitForFirstConsumer
. I failed to reproduce this in my experiments -- I wonder if this problem has been solved outside of StackRox somehow?
I am connected to an AKS cluster and I have this:
❯ kc get sc | rg '(^NAME|default)'
NAME PROVISIONER RECLAIMPOLICY VOLUMEBINDINGMODE ALLOWVOLUMEEXPANSION AGE
default (default) disk.csi.azure.com Delete WaitForFirstConsumer true
My helm
CLI is version v3.13.3
.
I tried to install the stackrox-central-services
Helm chart with helm install --wait
. It took a while (until all pods are up and healthy) but it worked. A PVC got created:
❯ kc get pvc -A
NAMESPACE NAME STATUS VOLUME CAPACITY ACCESS MODES STORAGECLASS AGE
stackrox central-db Bound pvc-922dfbca-3273-4342-894b-5cf17d648eea 100Gi RWO default 6m
Just to verify, the PVC does have the Helm hooks associated:
❯ kc get -n stackrox pvc central-db -o yaml | yq .metadata.annotations | rg hook
helm.sh/hook: pre-install,pre-upgrade
helm.sh/hook-delete-policy: never
Can you confirm this or am I missing something here?
Thanks Moritz
Hi Moritz
Thanks for circling back and trying this again!
I tried again on a vanilla AKS and didn't get any further...
$ kubectl get sc | grep -P '(^NAME|default)'
NAME PROVISIONER RECLAIMPOLICY VOLUMEBINDINGMODE ALLOWVOLUMEEXPANSION AGE
default (default) disk.csi.azure.com Delete WaitForFirstConsumer true 3d13h
$ kubectl get pvc -A | grep -P '(^NAME|stackrox)'
NAMESPACE NAME STATUS VOLUME CAPACITY ACCESS MODES STORAGECLASS AGE
stackrox-issue-2482 stackrox-db Pending default 8m30s
$ kubectl get pvc stackrox-db -oyaml | yq .metadata.annotations | grep -P hook
helm.sh/hook: pre-install,pre-upgrade
helm.sh/hook-delete-policy: never
# in fact:
$ kubectl get-all -n stackrox-issue-2482
NAME NAMESPACE AGE
configmap/kube-root-ca.crt stackrox-issue-2482 21m
persistentvolumeclaim/stackrox-db stackrox-issue-2482 12m
serviceaccount/default stackrox-issue-2482 21m
secret/central-htpasswd stackrox-issue-2482 12m
secret/central-tls stackrox-issue-2482 12m
secret/scanner-db-password stackrox-issue-2482 12m
secret/scanner-db-tls stackrox-issue-2482 12m
secret/scanner-tls stackrox-issue-2482 12m
secret/stackrox-generated-letxqh stackrox-issue-2482 12m
$ kubectl events pvc stackrox-db
LAST SEEN TYPE REASON OBJECT MESSAGE
2m50s (x82 over 23m) Normal WaitForFirstConsumer PersistentVolumeClaim/stackrox-db waiting for first consumer to be created before binding
I don't think I'll be able to dig much deeper today but will investigate further. Some notes for now:
I'm still using Argo CD for the Deployment (and not plain Helm). Argo CD doesn't just call the helm
binary so it might be contributing to the issue. The difference in our PVC naming leads me to think we might want to compare value and versions... below is a dump of the test i did to get the above results.
apiVersion: argoproj.io/v1alpha1
kind: Application
metadata:
name: app-stackrox
namespace: infra-argocd
spec:
destination:
namespace: stackrox-issue-2482
server: https://kubernetes.default.svc
project: test
source:
chart: stackrox-central-services
helm:
releaseName: stackrox-central-services
values: |-
allowNonstandardNamespace: true
system:
enablePodSecurityPolicies: false
repoURL: https://raw.githubusercontent.com/stackrox/helm-charts/main/opensource/
targetRevision: 74.9.0
As you can see, it's quote minimal. It also fails pretty early, so i don't even get to pods to get scheduled and images to get pulled.
When i initially reported this issue there where some bugs in Argo CD that compounded it's effects. Those have been fixed leading to a somewhat clearer(ish) picture.
As far as i can tell, using Helm on the CLI applies the resources in the order mandated by the hooks, and Helm waits for the api server to accept them before progessing to the next stage. OTOH Argo CD, being a fully fledged meta-operator, waits for the resources to become available before progressing to the next stage. In the case of WaitForFirstConsumer this never happens so Argo CD won't ever progress past the hook stage.
I'm not sure this can be fundamentally addressed in Argo CD without it breaking any promises it makes to users regards how hooks work when it comes to some subtleties like this (which isn't as well defined in Helm as would be preferable).
I'm also failing to understand what purpose the hooks actually serve. One of the core tasks of Kubernetes's scheduler is to schedule workloads to a node while utilizing the CSI api to figure out where, and when, a workload can be run. The hooks currently preempt the scheduler from reconciling this as it happens and they making it wait for an actual resolution strategy to become available.
That said, although i can't think of a scenario where the hook annotation isn't a subtle bug and removing it a lowest risk change, i can understand that your risk posture probably differs from mine. Since external DB support was released being able to deploy Stackrox in this way using Argo CD has basically been relegated to a lab requirement.
Anyways, our application is progressing well and the fact that this issue is very far from my critical path, leads me to tend towards considering it fixed in a sense of there being simple and established workarounds :)
Cheers, Lucas
Btw, found an issue about this for ArgoCD: https://github.com/argoproj/argo-cd/issues/12840
Hi @hairmare , I have a few comments and questions, some of which might help, but might be just related, depending on your use case. Feel free to ignore them as you see fit.
Argo CD is trying to remove hook resources from the previous sync before it starts the next sync
It seems to me that ArgoCD is not obeying the hook-delete-policy: never
annotation? Perhaps this is a bug in ArgoCD?
targetRevision: 74.9.0
suggests that you're deploying an (almost?) end-of-life version of stackrox, i.e. 3.74.9. Perhaps it would make sense to look into upgrading to the 4.x version? Although I'm pretty sure this particular issue is also present there, so this is more of a general remark.As I understand the issue for @hairmare is somewhat solved, since he wrote:
Anyways, our application is progressing well and the fact that this issue is very far from my critical path, leads me to tend towards considering it fixed in a sense of there being simple and established workarounds :)
But I was able to verify that the combination still breaks: AKS + ArgoCD + StackRox Helm Chart => deadlocked.
One can probably use the mentioned workaround or even plug in kustomize into ArgoCD, I've seen that somewhere, here it is: https://github.com/argoproj/argocd-example-apps/blob/master/plugins/kustomized-helm/README.md
Furthermore, we do have some evidence for this issue not being present when using the stackrox operator in ArgoCD -- this is what we are doing internally.
@porridge, not sure what you are referring to here:
Funny (or sad) how the bug which @mclasmeier found also mentions stackrox.
@porridge, not sure what you are referring to here:
Funny (or sad) how the bug which @mclasmeier found also mentions stackrox.
I mean that the person who file the bug you found, also encountered it when trying to deploy stackrox.
Ah, I didn't notice stackrox being mentioned in the attached screenshot.
By the way, there is even another issue for this here: https://github.com/stackrox/stackrox/issues/5191
@mclasmeier thanks for all the pointers, i have some reading to do
The issue isn't solved as is, but it has a low prio because i can circumvent it by using an external DB or using the workaround from this comment. I would appreciate it to work without any of these, but my case foresees using a DBaaS anyway.
@hairmare, I might have good news for you.
A change was merged to master, which allows the user of the Helm charts to overwrite these annoations. For example:
customize:
other:
persistentvolumeclaim/central-db:
annotations:
helm.sh/hook: null
helm.sh/hook-delete-policy: null
helm.sh/resource-policy: null
This would cause the Helm annotations for the central-db PVC to be deleted. Caveat: When using this kind of flexibility for modifying the manifests during rendering you better know exactly what you are doing -- particularly in terms of lifecycle management. This is a powerful mechanism, but I'm sure it also allows shooting yourself in the foot.
I have tried this with argocd on an AKS cluster (default StorageClass is using WaitForFirstConsumer
) and it seems to work.
In case you need some help for trying it out, feel free to ping me.
The same mechanism can also be used for changing the secret annotations.
That being said, I think we can close this issue, do you agree?
Thank you!
@hairmare, I might have good news for you.
A change was merged to master, which allows the user of the Helm charts to overwrite these annoations. For example:
customize: other: persistentvolumeclaim/central-db: annotations: helm.sh/hook: null helm.sh/hook-delete-policy: null helm.sh/resource-policy: null
Sorry to chime in, but as I am hit by the same issue I would like to use this.
Does this just go into the values.yaml? Is the customize.other
still correct? As this is mentioned nowhere in the values.yaml and it does not seem to work properly in my tests.
Thanks in advance!
OK, the problem seems to be that the annotations are using quoted keys.
Using the following snippet ...
customize:
other:
persistentvolumeclaim/central-db:
annotations:
helm.sh/hook: XXX
helm.sh/hook-delete-policy: XXX
helm.sh/resource-policy: XXX
when running a local helm template
I see that my annotations are added in addition to the ones being present already:
# Source: stackrox-central-services/charts/stackrox-central-services/templates/01-central-11-db-pvc.yaml
apiVersion: v1
kind: PersistentVolumeClaim
metadata:
name: central-db
namespace: stackrox
labels:
app.kubernetes.io/component: central
app.kubernetes.io/instance: stackrox-central-services
app.kubernetes.io/managed-by: Helm
app.kubernetes.io/name: stackrox
app.kubernetes.io/part-of: stackrox-central-services
app.kubernetes.io/version: 4.4.4
helm.sh/chart: stackrox-central-services-400.4.4
annotations:
email: support@stackrox.com
helm.sh/hook: XXX
helm.sh/hook-delete-policy: XXX
helm.sh/resource-policy: XXX
meta.helm.sh/release-name: stackrox-central-services
meta.helm.sh/release-namespace: stackrox
owner: stackrox
"helm.sh/hook": "pre-install,pre-upgrade"
"helm.sh/resource-policy": keep
"helm.sh/hook-delete-policy": never
Notice both helm.sh/hook
and "helm.sh/hook"
are set.
I'll try to overwrite the quoted annotations, my first tries did not work...
@kastl-ars Just to verify: which version of the Helm chart are you using?
@kastl-ars Just to verify: which version of the Helm chart are you using?
400.4.4
The feature you need for this has only landed in the 4.5.x series.
The feature you need for this has only landed in the 4.5.x series.
OK, running the helm template
with 400.5.0 I do no longer see any PVC being created. I'll try and report back...
No idea what went wrong, but now I see the PVC. And it still has the "wrong" annotations that I would like to get rid of...
apiVersion: v1
kind: PersistentVolumeClaim
metadata:
name: central-db
namespace: stackrox
labels:
app.kubernetes.io/component: central
app.kubernetes.io/instance: stackrox-central-services
app.kubernetes.io/managed-by: Helm
app.kubernetes.io/name: stackrox
app.kubernetes.io/part-of: stackrox-central-services
app.kubernetes.io/version: 4.5.0
helm.sh/chart: stackrox-central-services-400.5.0
annotations:
email: support@stackrox.com
helm.sh/hook: pre-install,pre-upgrade
helm.sh/hook-delete-policy: never
helm.sh/resource-policy: keep
meta.helm.sh/release-name: stackrox-central-services
meta.helm.sh/release-namespace: stackrox
owner: stackrox
Please provide the argoCD Application.
Please provide the argoCD Application.
I cannot, as it is being created by a rather complex ApplicationSet due to historic reasons.
We have a chart.yaml
that has the stackrox-central-services chart as dependency (and does nothing else).
apiVersion: v2
name: some-name
version: 1.0.1
dependencies:
- name: stackrox-central-services
repository: https://raw.githubusercontent.com/stackrox/helm-charts/main/opensource/
version: 400.5.0
And the corresponding values.yaml
has all stackrox-central-services-related settings beneath a top-level key of the same name, so those are valid for the dependency chart, not the actual (empty) chart:
stackrox-central-services:
allowNonstandardReleaseName: true
customize:
other:
persistentvolumeclaim/central-db:
annotations:
"helm.sh/hook": null
"helm.sh/hook-delete-policy": null
"helm.sh/resource-policy": null
env:
openshift: 4
istio: false
platform: "default"
offlineMode: true
system:
enablePodSecurityPolicies: false
meta:
useLookup: false
central:
telemetry:
enabled: false
persistence:
none: true
adminPassword:
htpasswd: "htpasswd-bcrypt-hash-goes-here"
exposure:
route:
enabled: true
resources:
requests:
memory: 1Gi
cpu: 1
limits:
memory: 4Gi
cpu: 1
db:
resources:
requests:
memory: 1Gi
cpu: 500m
limits:
memory: 4Gi
cpu: 1
scanner:
autoscaling:
disable: true
replicas: 1
resources:
requests:
memory: 500Mi
cpu: 500m
limits:
memory: 2500Mi
cpu: 2000m
Complicated, but used pretty often in ArgoCD setups.
I hope this helps.
(I have since removed the meta.useLookup that I tried before, but no change).
OK , I just tested a different approach. I created a (trimmed down) values.yaml file and called helm locally. So this is using the chart directly (not as a dependency) and is not using ArgoCD.
---
customize:
other:
persistentvolumeclaim/central-db:
annotations:
"helm.sh/hook": null
"helm.sh/hook-delete-policy": null
"helm.sh/resource-policy": null
env:
openshift: 4
istio: false
platform: "default"
offlineMode: true
system:
enablePodSecurityPolicies: false
central:
telemetry:
enabled: false
persistence:
none: true
adminPassword:
htpasswd: 'bcrypt_hash'
helm template stackrox-central-services stackrox/stackrox-central-services -n stackrox -f values.yaml | tee helm-template.yml
The resulting file helm-template.yml
does NOT contain the helm.hook thingy, so overwriting/removing the annotations works there.
# Source: stackrox-central-services/templates/01-central-11-db-pvc.yaml
apiVersion: v1
kind: PersistentVolumeClaim
metadata:
name: central-db
namespace: stackrox
labels:
app.kubernetes.io/component: central
app.kubernetes.io/instance: stackrox-central-services
app.kubernetes.io/managed-by: Helm
app.kubernetes.io/name: stackrox
app.kubernetes.io/part-of: stackrox-central-services
app.kubernetes.io/version: 4.5.0
helm.sh/chart: stackrox-central-services-400.5.0
annotations:
email: support@stackrox.com
meta.helm.sh/release-name: stackrox-central-services
meta.helm.sh/release-namespace: stackrox
owner: stackrox
spec:
accessModes:
- ReadWriteOnce
resources:
requests:
storage: 100Gi
So I guess this is a combination of ArgoCD removing the quotes or some other issue in the ArgoCD process.
P.S. As helm no longer recommends/requires quoting these annotations it might be a good idea to remove the quotes. But that might be breaking existing installations, hooray... ;-(
For the time being I manually created the PVC and told the chart to use it:
central:
db:
persistence:
persistentVolumeClaim:
claimName: my-own-pvc
createClaim: false
I started a discussion on the ArgoCD repository: https://github.com/argoproj/argo-cd/discussions/19254
Ok, so I was able to conduct some experiments. I have created an OpenShift 4.16.3 cluster. I installed the Red Hat OpenShift GitOps v1.13.0 operator on it. I created an ArgoCD instance. I applied the following manifest:
apiVersion: argoproj.io/v1alpha1
kind: Application
metadata:
name: stackrox-central-services
namespace: openshift-gitops
labels:
name: stackrox-central-services
spec:
project: default
source:
repoURL: "https://github.com/stackrox/helm-charts.git"
targetRevision: HEAD
path: "4.5.0/central-services"
helm:
version: v3
valuesObject:
allowNonstandardNamespace: true
monitoring:
openshift:
enabled: false
system:
enablePodSecurityPolicies: false
imagePullSecrets:
allowNone: true
central:
persistence:
none: true
customize:
other:
persistentvolumeclaim/central-db:
annotations:
"helm.sh/hook": null
"helm.sh/hook-delete-policy": null
"helm.sh/resource-policy": null
destination:
server: "https://kubernetes.default.svc"
namespace: stackrox
I triggered a sync using:
$ argocd app sync openshift-gitops/stackrox-central-services
(To make this work properly I also had to execute oc label ns stackrox "argocd.argoproj.io/managed-by=openshift-gitops"
as documented.)
This caused StackRox to be deployed on the cluster and I verified that the Helm hook annotations on the central-db
PVC are gone.
Furthermore: I was able to verify that it doesn't matter if I quote the annotation keys in the provided values or not. In both cases the hook annotations are not present. And if I remove the customization
entirely then the hook annotations are present.
The ArgoCD that has been deployed on the cluster is v2.11.3+3f344d5
.
So, I cannot reproduce the issue in this specific environment. :-/
Thanks for checking. Then I think this might be yet another issue where helm does things differently if values are for a chart or a sub-chart / dependency... :-(
As I was able to work around this by using an existing PVC I would say we can stop right here. In general it works, the chart has a possibility to remove the annotations, it works locally with helm and directly in ArgoCD. That should be good enough...
Thanks for your help, @mclasmeier !
Alright! Thanks for your input on these matters. Would be nice to have a smooth user experience on ArgoCD.
The
stackrox/stackrox-central-services:70.1.0
Helm chart is creating a PVC with ahelm.sh/hook: pre-install,pre-upgrade
. On clusters that have a StorageClass configured withWaitForFirstConsumer
as volume binding mode the hook will never finish because the PV that underpins the PVC will only be created once a Deployment tries to mount it.One example where this is happening is when using the
default
StorageClass on an Azure AKS cluster.My proposal to fix this would be to remove the helm hook annotation on the PVC. Let me know if that sounds ok and i'll prepare a PR to that effect.