stackrox / stackrox

The StackRox Kubernetes Security Platform performs a risk analysis of the container environment, delivers visibility and runtime alerts, and provides recommendations to proactively improve security by hardening the environment.
Apache License 2.0
1.12k stars 144 forks source link

PVC waiting for first consumer to be created before binding #2482

Closed hairmare closed 7 months ago

hairmare commented 2 years ago

The stackrox/stackrox-central-services:70.1.0 Helm chart is creating a PVC with a helm.sh/hook: pre-install,pre-upgrade. On clusters that have a StorageClass configured with WaitForFirstConsumer as volume binding mode the hook will never finish because the PV that underpins the PVC will only be created once a Deployment tries to mount it.

One example where this is happening is when using the default StorageClass on an Azure AKS cluster.

My proposal to fix this would be to remove the helm hook annotation on the PVC. Let me know if that sounds ok and i'll prepare a PR to that effect.

mtesseract commented 2 years ago

Hello @hairmare! Thank you for your proposal. We will look into this and come back to you when we can provide more information on this topic. Greetings, Moritz

hairmare commented 2 years ago

I have since noticed that a bunch of other resources (mainly secrets) also have a helm-hook annotation and would like to also propose removing the annotation from those resources.

The main reason (in addition to the WaitForFirstConsumer) is that i'm deploying the chart using Argo CD (aka OpenShift GitOps) and that Argo CD is trying to remove hook resources from the previous sync before it starts the next sync. This makes it non-trivial to use the chart with Argo CD to rollout changes to an already deployed instance of central-services.

hairmare commented 2 years ago

Hey @mtesseract Were you able to get any feedback from the team? Argo CD still wants to remove important parts of the deployment with every update and i'd be happy to provide a fix. Thanks, Lucas

mtesseract commented 2 years ago

Hi Lucas,

sorry that I wasn't able to get back to you earlier. So, I've had a first look at this and would like to describe my thoughts around this. Please correct me if I got something wrong (please note that I haven't used Argo CD or Azure AKS before).

I have the feeling that we are dealing with two issues here, is this right? First would be about the WaitForFirstConsumer setting and how the Chart could be installed when WaitForFirstConsumer is activated for the involved PVC.

Second would be the problem with deleting Helm hooks during an Argo CD sync.

Right now I have only thought about the second.

So, I checked the Argo CD documentation in order to learn something about how Argo CD deals with hooks. I found https://argo-cd.readthedocs.io/en/stable/user-guide/resource_hooks/#hook-deletion-policies and according to this page BeforeHookCreation would be used as deletion policy by default.

When looking at https://argo-cd.readthedocs.io/en/stable/user-guide/helm/#helm-hooks it seems to me that the hook-delete-policy simply carries over from the Helm hook to an Argo CD hook -- and if it is not present, then this would default to BeforeHookCreation, which would explain the behaviour you are observing. Correct so far?

I checked if we use hook-deletion-policies currently and the only such annotations I could find are of the form https://github.com/stackrox/stackrox/blob/master/image/templates/helm/stackrox-central/templates/01-central-11-pvc.yaml#L27, which is not converted correctly into Argo CD world.

In general: We must be very careful about changing the semantics of the hook resources as current users might depend on the current behaviour. For example, a PV is expected to not be destroyed when a Helm chart is deinstalled. Secrets are another example of resources which we do not want to be deleted implicitly.

Now, I wonder what would happen if we keep the Hook nature of the current Hook resources and add a deletion policy of hook-failed -- would this prevent these resources from being deleted by Argo CD on a sync?

Thanks Moritz

porridge commented 1 year ago

@hairmare can you please comment on @mtesseract 's response?

mclasmeier commented 8 months ago

I'd like to follow-up on this one.

Regarding the WaitForFirstConsumer. I failed to reproduce this in my experiments -- I wonder if this problem has been solved outside of StackRox somehow?

I am connected to an AKS cluster and I have this:

❯ kc get sc | rg '(^NAME|default)'
NAME                    PROVISIONER          RECLAIMPOLICY   VOLUMEBINDINGMODE      ALLOWVOLUMEEXPANSION   AGE
default (default)       disk.csi.azure.com   Delete          WaitForFirstConsumer   true  

My helm CLI is version v3.13.3.

I tried to install the stackrox-central-services Helm chart with helm install --wait. It took a while (until all pods are up and healthy) but it worked. A PVC got created:

❯ kc get pvc -A
NAMESPACE   NAME         STATUS   VOLUME                                     CAPACITY   ACCESS MODES   STORAGECLASS   AGE
stackrox    central-db   Bound    pvc-922dfbca-3273-4342-894b-5cf17d648eea   100Gi      RWO            default        6m

Just to verify, the PVC does have the Helm hooks associated:

❯ kc get -n stackrox pvc central-db -o yaml | yq .metadata.annotations | rg hook
helm.sh/hook: pre-install,pre-upgrade
helm.sh/hook-delete-policy: never

Can you confirm this or am I missing something here?

Thanks Moritz

hairmare commented 8 months ago

Hi Moritz

Thanks for circling back and trying this again!

I tried again on a vanilla AKS and didn't get any further...

$ kubectl get sc | grep -P '(^NAME|default)'
NAME                    PROVISIONER          RECLAIMPOLICY   VOLUMEBINDINGMODE      ALLOWVOLUMEEXPANSION   AGE
default (default)       disk.csi.azure.com   Delete          WaitForFirstConsumer   true                   3d13h

$ kubectl get pvc -A | grep -P '(^NAME|stackrox)'
NAMESPACE             NAME                                                                             STATUS    VOLUME                                     CAPACITY   ACCESS MODES   STORAGECLASS   AGE
stackrox-issue-2482   stackrox-db                                                                      Pending                                                                        default        8m30s

$ kubectl get pvc stackrox-db -oyaml | yq .metadata.annotations | grep -P hook 
helm.sh/hook: pre-install,pre-upgrade
helm.sh/hook-delete-policy: never

# in fact:
$ kubectl get-all -n stackrox-issue-2482
NAME                               NAMESPACE            AGE
configmap/kube-root-ca.crt         stackrox-issue-2482  21m  
persistentvolumeclaim/stackrox-db  stackrox-issue-2482  12m  
serviceaccount/default             stackrox-issue-2482  21m  
secret/central-htpasswd            stackrox-issue-2482  12m  
secret/central-tls                 stackrox-issue-2482  12m  
secret/scanner-db-password         stackrox-issue-2482  12m  
secret/scanner-db-tls              stackrox-issue-2482  12m  
secret/scanner-tls                 stackrox-issue-2482  12m  
secret/stackrox-generated-letxqh   stackrox-issue-2482  12m  

$ kubectl events pvc stackrox-db
LAST SEEN              TYPE     REASON                 OBJECT                              MESSAGE
2m50s (x82 over 23m)   Normal   WaitForFirstConsumer   PersistentVolumeClaim/stackrox-db   waiting for first consumer to be created before binding

I don't think I'll be able to dig much deeper today but will investigate further. Some notes for now:

I'm still using Argo CD for the Deployment (and not plain Helm). Argo CD doesn't just call the helm binary so it might be contributing to the issue. The difference in our PVC naming leads me to think we might want to compare value and versions... below is a dump of the test i did to get the above results.

apiVersion: argoproj.io/v1alpha1
kind: Application
metadata:
  name: app-stackrox
  namespace: infra-argocd
spec:
  destination:
    namespace: stackrox-issue-2482
    server: https://kubernetes.default.svc
  project: test
  source:
    chart: stackrox-central-services
    helm:
      releaseName: stackrox-central-services
      values: |-
        allowNonstandardNamespace: true
        system:
          enablePodSecurityPolicies: false
    repoURL: https://raw.githubusercontent.com/stackrox/helm-charts/main/opensource/
    targetRevision: 74.9.0

As you can see, it's quote minimal. It also fails pretty early, so i don't even get to pods to get scheduled and images to get pulled.

When i initially reported this issue there where some bugs in Argo CD that compounded it's effects. Those have been fixed leading to a somewhat clearer(ish) picture.

As far as i can tell, using Helm on the CLI applies the resources in the order mandated by the hooks, and Helm waits for the api server to accept them before progessing to the next stage. OTOH Argo CD, being a fully fledged meta-operator, waits for the resources to become available before progressing to the next stage. In the case of WaitForFirstConsumer this never happens so Argo CD won't ever progress past the hook stage.

I'm not sure this can be fundamentally addressed in Argo CD without it breaking any promises it makes to users regards how hooks work when it comes to some subtleties like this (which isn't as well defined in Helm as would be preferable).

I'm also failing to understand what purpose the hooks actually serve. One of the core tasks of Kubernetes's scheduler is to schedule workloads to a node while utilizing the CSI api to figure out where, and when, a workload can be run. The hooks currently preempt the scheduler from reconciling this as it happens and they making it wait for an actual resolution strategy to become available.

That said, although i can't think of a scenario where the hook annotation isn't a subtle bug and removing it a lowest risk change, i can understand that your risk posture probably differs from mine. Since external DB support was released being able to deploy Stackrox in this way using Argo CD has basically been relegated to a lab requirement.

Anyways, our application is progressing well and the fact that this issue is very far from my critical path, leads me to tend towards considering it fixed in a sense of there being simple and established workarounds :)

Cheers, Lucas

mclasmeier commented 8 months ago

Btw, found an issue about this for ArgoCD: https://github.com/argoproj/argo-cd/issues/12840

porridge commented 8 months ago

Hi @hairmare , I have a few comments and questions, some of which might help, but might be just related, depending on your use case. Feel free to ignore them as you see fit.

  1. Argo CD is trying to remove hook resources from the previous sync before it starts the next sync

    It seems to me that ArgoCD is not obeying the hook-delete-policy: never annotation? Perhaps this is a bug in ArgoCD?

  2. Your mention of targetRevision: 74.9.0 suggests that you're deploying an (almost?) end-of-life version of stackrox, i.e. 3.74.9. Perhaps it would make sense to look into upgrading to the 4.x version? Although I'm pretty sure this particular issue is also present there, so this is more of a general remark.
  3. Also on a more general note, but this might actually help with this issue: since you mentioned using OpenShift GitOps, have you considered using the ACS operator that we ship for OpenShift? I'd be very interested in any feedback you might have regarding the feasibility of moving from helm-based stackrox installation to an operator-based one. In case the problem is that there is no operator support for non-OpenShift platforms, would it affect your decision if we started supporting it for other platforms too?
  4. Funny (or sad) how the bug which @mclasmeier found also mentions stackrox. Note that this comment has a workaround for your late PVC binding issue @hairmare . Do you think it could solve it for you?
mclasmeier commented 8 months ago

As I understand the issue for @hairmare is somewhat solved, since he wrote:

Anyways, our application is progressing well and the fact that this issue is very far from my critical path, leads me to tend towards considering it fixed in a sense of there being simple and established workarounds :)

But I was able to verify that the combination still breaks: AKS + ArgoCD + StackRox Helm Chart => deadlocked.

One can probably use the mentioned workaround or even plug in kustomize into ArgoCD, I've seen that somewhere, here it is: https://github.com/argoproj/argocd-example-apps/blob/master/plugins/kustomized-helm/README.md

Furthermore, we do have some evidence for this issue not being present when using the stackrox operator in ArgoCD -- this is what we are doing internally.

@porridge, not sure what you are referring to here:

Funny (or sad) how the bug which @mclasmeier found also mentions stackrox.

porridge commented 8 months ago

@porridge, not sure what you are referring to here:

Funny (or sad) how the bug which @mclasmeier found also mentions stackrox.

I mean that the person who file the bug you found, also encountered it when trying to deploy stackrox.

mclasmeier commented 8 months ago

Ah, I didn't notice stackrox being mentioned in the attached screenshot.

By the way, there is even another issue for this here: https://github.com/stackrox/stackrox/issues/5191

hairmare commented 8 months ago

@mclasmeier thanks for all the pointers, i have some reading to do

The issue isn't solved as is, but it has a low prio because i can circumvent it by using an external DB or using the workaround from this comment. I would appreciate it to work without any of these, but my case foresees using a DBaaS anyway.

mclasmeier commented 8 months ago

@hairmare, I might have good news for you.

A change was merged to master, which allows the user of the Helm charts to overwrite these annoations. For example:

customize:
  other:
    persistentvolumeclaim/central-db:
      annotations:
        helm.sh/hook: null
        helm.sh/hook-delete-policy: null
        helm.sh/resource-policy: null

This would cause the Helm annotations for the central-db PVC to be deleted. Caveat: When using this kind of flexibility for modifying the manifests during rendering you better know exactly what you are doing -- particularly in terms of lifecycle management. This is a powerful mechanism, but I'm sure it also allows shooting yourself in the foot.

I have tried this with argocd on an AKS cluster (default StorageClass is using WaitForFirstConsumer) and it seems to work.

In case you need some help for trying it out, feel free to ping me.

The same mechanism can also be used for changing the secret annotations.

That being said, I think we can close this issue, do you agree?

Thank you!

kastl-ars commented 3 months ago

@hairmare, I might have good news for you.

A change was merged to master, which allows the user of the Helm charts to overwrite these annoations. For example:

customize:
  other:
    persistentvolumeclaim/central-db:
      annotations:
        helm.sh/hook: null
        helm.sh/hook-delete-policy: null
        helm.sh/resource-policy: null

Sorry to chime in, but as I am hit by the same issue I would like to use this.

Does this just go into the values.yaml? Is the customize.other still correct? As this is mentioned nowhere in the values.yaml and it does not seem to work properly in my tests.

Thanks in advance!

kastl-ars commented 3 months ago

OK, the problem seems to be that the annotations are using quoted keys.

Using the following snippet ...

customize:
  other:
    persistentvolumeclaim/central-db:
      annotations:
        helm.sh/hook: XXX
        helm.sh/hook-delete-policy: XXX
        helm.sh/resource-policy: XXX

when running a local helm template I see that my annotations are added in addition to the ones being present already:

# Source: stackrox-central-services/charts/stackrox-central-services/templates/01-central-11-db-pvc.yaml
apiVersion: v1
kind: PersistentVolumeClaim
metadata:
  name: central-db
  namespace: stackrox
  labels:
    app.kubernetes.io/component: central
    app.kubernetes.io/instance: stackrox-central-services
    app.kubernetes.io/managed-by: Helm
    app.kubernetes.io/name: stackrox
    app.kubernetes.io/part-of: stackrox-central-services
    app.kubernetes.io/version: 4.4.4
    helm.sh/chart: stackrox-central-services-400.4.4
  annotations:
    email: support@stackrox.com
    helm.sh/hook: XXX
    helm.sh/hook-delete-policy: XXX
    helm.sh/resource-policy: XXX
    meta.helm.sh/release-name: stackrox-central-services
    meta.helm.sh/release-namespace: stackrox
    owner: stackrox
    "helm.sh/hook": "pre-install,pre-upgrade"
    "helm.sh/resource-policy": keep
    "helm.sh/hook-delete-policy": never

Notice both helm.sh/hook and "helm.sh/hook" are set.

I'll try to overwrite the quoted annotations, my first tries did not work...

mclasmeier commented 3 months ago

@kastl-ars Just to verify: which version of the Helm chart are you using?

kastl-ars commented 3 months ago

@kastl-ars Just to verify: which version of the Helm chart are you using?

400.4.4

mclasmeier commented 3 months ago

The feature you need for this has only landed in the 4.5.x series.

kastl-ars commented 3 months ago

The feature you need for this has only landed in the 4.5.x series.

OK, running the helm template with 400.5.0 I do no longer see any PVC being created. I'll try and report back...

No idea what went wrong, but now I see the PVC. And it still has the "wrong" annotations that I would like to get rid of...

apiVersion: v1
kind: PersistentVolumeClaim
metadata:
  name: central-db
  namespace: stackrox
  labels:
    app.kubernetes.io/component: central
    app.kubernetes.io/instance: stackrox-central-services
    app.kubernetes.io/managed-by: Helm
    app.kubernetes.io/name: stackrox
    app.kubernetes.io/part-of: stackrox-central-services
    app.kubernetes.io/version: 4.5.0
    helm.sh/chart: stackrox-central-services-400.5.0
  annotations:
    email: support@stackrox.com
    helm.sh/hook: pre-install,pre-upgrade
    helm.sh/hook-delete-policy: never
    helm.sh/resource-policy: keep
    meta.helm.sh/release-name: stackrox-central-services
    meta.helm.sh/release-namespace: stackrox
    owner: stackrox
mclasmeier commented 3 months ago

Please provide the argoCD Application.

kastl-ars commented 3 months ago

Please provide the argoCD Application.

I cannot, as it is being created by a rather complex ApplicationSet due to historic reasons.

We have a chart.yaml that has the stackrox-central-services chart as dependency (and does nothing else).

apiVersion: v2
name: some-name
version: 1.0.1
dependencies:
  - name: stackrox-central-services
    repository: https://raw.githubusercontent.com/stackrox/helm-charts/main/opensource/
    version: 400.5.0

And the corresponding values.yaml has all stackrox-central-services-related settings beneath a top-level key of the same name, so those are valid for the dependency chart, not the actual (empty) chart:

stackrox-central-services:
  allowNonstandardReleaseName: true
  customize:
    other:
      persistentvolumeclaim/central-db:
        annotations:
          "helm.sh/hook": null
          "helm.sh/hook-delete-policy": null
          "helm.sh/resource-policy": null
  env:
    openshift: 4
    istio: false
    platform: "default"
    offlineMode: true
  system:
    enablePodSecurityPolicies: false
  meta:
    useLookup: false
  central:
    telemetry:
      enabled: false
    persistence:
      none: true
    adminPassword:
      htpasswd: "htpasswd-bcrypt-hash-goes-here"
    exposure:
      route:
        enabled: true
    resources:
      requests:
        memory: 1Gi
        cpu: 1
      limits:
        memory: 4Gi
        cpu: 1
    db:
      resources:
        requests:
          memory: 1Gi
          cpu: 500m
        limits:
          memory: 4Gi
          cpu: 1
  scanner:
    autoscaling:
      disable: true
    replicas: 1
    resources:
      requests:
        memory: 500Mi
        cpu: 500m
      limits:
        memory: 2500Mi
        cpu: 2000m

Complicated, but used pretty often in ArgoCD setups.

I hope this helps.

kastl-ars commented 3 months ago

(I have since removed the meta.useLookup that I tried before, but no change).

kastl-ars commented 3 months ago

OK , I just tested a different approach. I created a (trimmed down) values.yaml file and called helm locally. So this is using the chart directly (not as a dependency) and is not using ArgoCD.

---
customize:
  other:
    persistentvolumeclaim/central-db:
      annotations:
        "helm.sh/hook": null
        "helm.sh/hook-delete-policy": null
        "helm.sh/resource-policy": null
env:
  openshift: 4
  istio: false
  platform: "default"
  offlineMode: true
system:
  enablePodSecurityPolicies: false
central:
  telemetry:
    enabled: false
  persistence:
    none: true
  adminPassword:
    htpasswd: 'bcrypt_hash'
helm template stackrox-central-services stackrox/stackrox-central-services -n stackrox -f values.yaml | tee helm-template.yml

The resulting file helm-template.yml does NOT contain the helm.hook thingy, so overwriting/removing the annotations works there.

# Source: stackrox-central-services/templates/01-central-11-db-pvc.yaml
apiVersion: v1
kind: PersistentVolumeClaim
metadata:
  name: central-db
  namespace: stackrox
  labels:
    app.kubernetes.io/component: central
    app.kubernetes.io/instance: stackrox-central-services
    app.kubernetes.io/managed-by: Helm
    app.kubernetes.io/name: stackrox
    app.kubernetes.io/part-of: stackrox-central-services
    app.kubernetes.io/version: 4.5.0
    helm.sh/chart: stackrox-central-services-400.5.0
  annotations:
    email: support@stackrox.com
    meta.helm.sh/release-name: stackrox-central-services
    meta.helm.sh/release-namespace: stackrox
    owner: stackrox
spec:
  accessModes:
  - ReadWriteOnce
  resources:
    requests:
      storage: 100Gi

So I guess this is a combination of ArgoCD removing the quotes or some other issue in the ArgoCD process.

P.S. As helm no longer recommends/requires quoting these annotations it might be a good idea to remove the quotes. But that might be breaking existing installations, hooray... ;-(

kastl-ars commented 3 months ago

For the time being I manually created the PVC and told the chart to use it:

central:
  db:
    persistence: 
      persistentVolumeClaim:
        claimName: my-own-pvc
        createClaim: false

I started a discussion on the ArgoCD repository: https://github.com/argoproj/argo-cd/discussions/19254

mclasmeier commented 3 months ago

Ok, so I was able to conduct some experiments. I have created an OpenShift 4.16.3 cluster. I installed the Red Hat OpenShift GitOps v1.13.0 operator on it. I created an ArgoCD instance. I applied the following manifest:

apiVersion: argoproj.io/v1alpha1
kind: Application
metadata:
  name: stackrox-central-services
  namespace: openshift-gitops
  labels:
    name: stackrox-central-services
spec:
  project: default
  source:
    repoURL: "https://github.com/stackrox/helm-charts.git"
    targetRevision: HEAD
    path: "4.5.0/central-services"
    helm:
      version: v3
      valuesObject:
        allowNonstandardNamespace: true
        monitoring:
          openshift:
            enabled: false
        system:
          enablePodSecurityPolicies: false
        imagePullSecrets:
          allowNone: true
        central:
          persistence:
            none: true
        customize:
          other:
            persistentvolumeclaim/central-db:
              annotations:
                "helm.sh/hook": null
                "helm.sh/hook-delete-policy": null
                "helm.sh/resource-policy": null

  destination:
    server: "https://kubernetes.default.svc"
    namespace: stackrox

I triggered a sync using:

$ argocd app sync openshift-gitops/stackrox-central-services

(To make this work properly I also had to execute oc label ns stackrox "argocd.argoproj.io/managed-by=openshift-gitops" as documented.)

This caused StackRox to be deployed on the cluster and I verified that the Helm hook annotations on the central-db PVC are gone.

Furthermore: I was able to verify that it doesn't matter if I quote the annotation keys in the provided values or not. In both cases the hook annotations are not present. And if I remove the customization entirely then the hook annotations are present.

The ArgoCD that has been deployed on the cluster is v2.11.3+3f344d5.

So, I cannot reproduce the issue in this specific environment. :-/

kastl-ars commented 3 months ago

Thanks for checking. Then I think this might be yet another issue where helm does things differently if values are for a chart or a sub-chart / dependency... :-(

As I was able to work around this by using an existing PVC I would say we can stop right here. In general it works, the chart has a possibility to remove the annotations, it works locally with helm and directly in ArgoCD. That should be good enough...

Thanks for your help, @mclasmeier !

mclasmeier commented 3 months ago

Alright! Thanks for your input on these matters. Would be nice to have a smooth user experience on ArgoCD.