rancher / fleet

Deploy workloads from Git to large fleets of Kubernetes clusters
https://fleet.rancher.io/
Apache License 2.0
1.51k stars 226 forks source link

[BUG] gitRepo resources aren't applying to clusters when changing fleet workspace #1845

Closed slickwarren closed 12 months ago

slickwarren commented 1 year ago

Rancher Server Setup

Information about the Cluster

aiyengar2 commented 1 year ago

The root cause of this issue is an error seen in the Job resource created by the GitJob resource created by the GitRepo resource in any user-created Fleet workspace.

Error creating: pods "ctw-test1-44c94-bkdzs" is forbidden: violates PodSecurity "restricted:latest": allowPrivilegeEscalation != false (containers "gitcloner-initializer", "fleet" must set securityContext.allowPrivilegeEscalation=false), unrestricted capabilities (containers "gitcloner-initializer", "fleet" must set securityContext.capabilities.drop=["ALL"]), runAsNonRoot != true (pod or containers "gitcloner-initializer", "fleet" must set securityContext.runAsNonRoot=true), seccompProfile (pod or containers "gitcloner-initializer", "fleet" must set securityContext.seccompProfile.type to "RuntimeDefault" or "Localhost")

This is because the base cluster used when filing this issue was a hardened 1.25+ (1.27) cluster, which means that it enforces a restrictive set of Pod Security Standards.

Since the new namespace is not "allow-listed" to deploy privileged pods, the initContainer of the Job tied to the GitJob tied to the GitRepo is not able to progress, which results in the GitRepo creating no new Bundle resources. From there, fleet-agent is responding as expected by not creating any resources.

I suspect this might be a general issue that was not tested in previous releases, so here are some new steps for reproduction:

To Reproduce

Expected Bad Result The cluster sees the same symptoms as we see here. The Job in that namespace tied to the GitJob tied to the GitRepo should be prevented from running due to PSA.

@slickwarren can you explicitly test this on an older Rancher version?

cc: @manno @olblak, this is a Fleet issue, so we should ideally transfer this over.

slickwarren commented 1 year ago

tested on a released version of rancher, 2.7.8, with a hardened local cluster and am experiencing the same error there:

 7 warnings.go:70] would violate PodSecurity "restricted:latest": allowPrivilegeEscalation != false (containers "working-dir-initializer", "place-tools", "step-git-source", "fleet" must set securityContext.allowPrivilegeEscalation=false), unrestricted capabilities (containers "working-dir-initializer", "place-tools", "step-git-source", "fleet" must set securityContext.capabilities.drop=["ALL"]), runAsNonRoot != true (pod or containers "working-dir-initializer", "place-tools", "step-git-source", "fleet" must set securityContext.runAsNonRoot=true), seccompProfile (pod or containers "working-dir-initializer", "place-tools", "step-git-source", "fleet" must set securityContext.seccompProfile.type to "RuntimeDefault" or "Localhost")

I'm actually seeing similar behavior on non-hardened clusters. (on 2.8-head)

Error creating: pods "ctwtest-44c94-nb7c2" is forbidden: violates PodSecurity "restricted:latest": allowPrivilegeEscalation != false (containers "gitcloner-initializer", "fleet" must set securityContext.allowPrivilegeEscalation=false), unrestricted capabilities (containers "gitcloner-initializer", "fleet" must set securityContext.capabilities.drop=["ALL"]), runAsNonRoot != true (pod or containers "gitcloner-initializer", "fleet" must set securityContext.runAsNonRoot=true), seccompProfile (pod or containers "gitcloner-initializer", "fleet" must set securityContext.seccompProfile.type to "RuntimeDefault" or "Localhost")
jiaqiluo commented 1 year ago

here are my 2 cents: If it is true that Fleet components must run with those mentioned permissions ( allowPrivilegeEscalation, runAsNonRoot, etc.), we have to either modify the PSS level on that namespace or whitelist that namespace.

kkaempf commented 1 year ago

@sbulage - can you please verify this bug ?

sbulage commented 1 year ago

Hello @slickwarren, I am hitting this issue while installing Rancher from 2.8-head.

can you please tell me the way to install Rancher on k8s version > 1.27.0-0 ?

Thanks in advance.

aiyengar2 commented 1 year ago

here are my 2 cents: If it is true that Fleet components must run with those mentioned permissions ( allowPrivilegeEscalation, runAsNonRoot, etc.), we have to either modify the PSS level on that namespace or whitelist that namespace.

The issue here is that these workspaces are user-created resources, so there may be security implications with any automatic process that would attempt to whitelist such a namespace. cc: @macedogm @pjbgf, this may fall in your wheelhouse

macedogm commented 1 year ago

We should not add new namespaces to the PSAC exempt list already provided by Rancher, because such namespaces would be user controlled (outside of Rancher's trust boundaries). Additionally, doing such allow lists dynamically is risky.

The recommendation is to fix Fleet's deployment to make sure that all components run with an unprivileged securityContext, as Jiaqi and Arvind mentioned.

I haven't dig deep, but if I saw correctly, wouldn't be the case of fixing GitJob's deployment to match Fleet's and Fleet-Agent's deployment, respectively?

https://github.com/rancher/fleet/blob/a1dd929a7f9a8df6ee451dad86174ccd303a60af/charts/fleet/templates/deployment.yaml#L62-L68

https://github.com/rancher/fleet/blob/a1dd929a7f9a8df6ee451dad86174ccd303a60af/charts/fleet-agent/templates/deployment.yaml#L29-L35

Note: @aiyengar2 thanks for pinging us on this.

raulcabello commented 1 year ago

We should not add new namespaces to the PSAC exempt list already provided by Rancher, because such namespaces would be user controlled (outside of Rancher's trust boundaries). Additionally, doing such allow lists dynamically is risky.

The recommendation is to fix Fleet's deployment to make sure that all components run with an unprivileged securityContext, as Jiaqi and Arvind mentioned.

I haven't dig deep, but if I saw correctly, wouldn't be the case of fixing GitJob's deployment to match Fleet's and Fleet-Agent's deployment, respectively?

https://github.com/rancher/fleet/blob/a1dd929a7f9a8df6ee451dad86174ccd303a60af/charts/fleet/templates/deployment.yaml#L62-L68

https://github.com/rancher/fleet/blob/a1dd929a7f9a8df6ee451dad86174ccd303a60af/charts/fleet-agent/templates/deployment.yaml#L29-L35

Note: @aiyengar2 thanks for pinging us on this.

This is the GitJob deployment, not the k8s Job that is created when a GitRepo is created or modified. I think we should also add the securityContext here and here to fix this issue.

I can't transfer a Cluster to a different workspace, but I don't see anything in the logs.

Rancher version: v2.8.0-alpha2 Installation option (Docker install/Helm Chart): rke2 v1.27.6+rke2r1 Downstream cluster: k3s v1.27.6+k3s1

Steps:

Then Cluster stays in Wait Check-In state and it is never moved to the new workspace I created. However I don't see anything in the gitjob or fleet-controller logs. And the k8s job that should run fleet apply is never created. Where should I look for the error?

aiyengar2 commented 1 year ago

@raulcabello the relationship between the GitRepo and Job is through the GitJob. When the GitRepo is modified, the related GitJob will also be changed:

https://github.com/rancher/fleet/blob/a1dd929a7f9a8df6ee451dad86174ccd303a60af/internal/cmd/controller/controllers/git/git.go#L433

That’s why the GitJob spawns a new Job.

So it is indeed tied to Job created when a GitRepo is modified. That is the k8s Job that runs the fleet apply.

aiyengar2 commented 1 year ago

Transferring the Fleet cluster is gated by a feature flag in Rancher, once you enable that in the UI (or by modifying the feature resource in the management cluster) you should be able to execute a transfer.

sbulage commented 1 year ago

After enabling the feature flag mentioned in this PR: https://github.com/rancher/dashboard/issues/9730#issuecomment-1749796841and https://github.com/rancher/fleet/issues/1845#issuecomment-1757405438.

It is working as expected. As we (me and @raulcabello ) can see the cluster can be moved from fleet-default workspace to newly created workspace (GitRepo is already present.).

Cluster details: It is non-hardened cluster

Also, tried to create new GitRepo in the newly created workspace and it is also working as expected.

No _error or warning_ traces found in gitjob pods.

K8s job is created and completed with no errors.

We will try with hardened cluster and try to reproduce it.

raulcabello commented 1 year ago

I think https://github.com/rancher/fleet/pull/1852 and https://github.com/rancher/gitjob/pull/331 should fix this issue. However, we can't test it as we are still not able to reproduce the issue.

@sbulage will try tomorrow to reproduce it in a hardened cluster

kkaempf commented 1 year ago

It needs to be

Also:

one would not get messages about violates PodSecurity on non-hardened clusters (restricted AdmissionConfiguration needs to be set at cluster level, see https://kubernetes.io/docs/tutorials/security/cluster-level-pss/ for more context).

raulcabello commented 1 year ago

We should not add new namespaces to the PSAC exempt list already provided by Rancher, because such namespaces would be user controlled (outside of Rancher's trust boundaries). Additionally, doing such allow lists dynamically is risky.

The recommendation is to fix Fleet's deployment to make sure that all components run with an unprivileged securityContext, as Jiaqi and Arvind mentioned.

I haven't dig deep, but if I saw correctly, wouldn't be the case of fixing GitJob's deployment to match Fleet's and Fleet-Agent's deployment, respectively?

https://github.com/rancher/fleet/blob/a1dd929a7f9a8df6ee451dad86174ccd303a60af/charts/fleet/templates/deployment.yaml#L62-L68

https://github.com/rancher/fleet/blob/a1dd929a7f9a8df6ee451dad86174ccd303a60af/charts/fleet-agent/templates/deployment.yaml#L29-L35

Note: @aiyengar2 thanks for pinging us on this.

@macedogm is it ok if we don't add the readOnlyRootFilesystem: true to the securityContext? This is causing problems when cloning git repos with go-git. go-git is creating temporary files that are causing the problem. go-git is a third party library.

The following securityContext in both containers in the job created by gitjob is working fine in hardened clusters in our test env.

    SecurityContext: &corev1.SecurityContext{
        AllowPrivilegeEscalation: &[]bool{false}[0],
        Privileged:               &[]bool{false}[0],
        RunAsNonRoot:             &[]bool{true}[0],
        SeccompProfile: &corev1.SeccompProfile{
            Type: corev1.SeccompProfileTypeRuntimeDefault,
        },
        Capabilities: &corev1.Capabilities{Drop: []corev1.Capability{"ALL"}},
    },

Is this securityContext enough? see https://github.com/rancher/fleet/pull/1860 and https://github.com/rancher/gitjob/pull/331

pjbgf commented 1 year ago

This is causing problems when cloning git repos with go-git. go-git is creating temporary files that are causing the problem. go-git is a third party library.

@raulcabello You can mount an emptyDir to that specific path (/tmp) to bypass this issue.

macedogm commented 1 year ago

@raulcabello agree with Paulo's suggestion above (in case its feasible).

sbulage commented 12 months ago

I have tested with Raul's fix with different images and found that fleet-agent pods are not re-creating with newer images. Till the time Raul's PR gets merged, he uploaded images and gave it to me for testing.

Throwing below error:

Pods "fleet-agent-79fc9f8d57-8k6kw" is forbidden: violates PodSecurity "restricted:latest": unrestricted capabilities (container "fleet-agent" must set securityContext.capabilities.drop=["ALL"]; container "fleet-agent" must not include "ALL" in securityContext.capabilities.add), runAsNonRoot != true (container "fleet-agent" must not set securityContext.runAsNonRoot=false), seccompProfile (pod or container "fleet-agent" must set securityContext.seccompProfile.type to "RuntimeDefault" or "Localhost"):Updated: 0/1

I closely look at the PSA exempted namespace list and saw that cattle-fleet-local-system namespace which is managed by Fleet-controller is absent. In order test the patch is working, I followed below stesp:

  1. I updated PSA file by adding cattle-fleet-local-system namespace to exempted namespace list.
  2. Restarted rk2 server
  3. Updated the fleet-agents image I observed that fleet-agent image gets updated without any error.

@pjbgf @macedogm What should I do? should I create new issue for it, if yes, then where? please let me know thanks :smile:

sbulage commented 12 months ago

I have tested with Raul's fix with different images and found that fleet-agent pods are not re-creating with newer images. Till the time Raul's PR gets merged, he uploaded images and gave it to me for testing.

Throwing below error:

Pods "fleet-agent-79fc9f8d57-8k6kw" is forbidden: violates PodSecurity "restricted:latest": unrestricted capabilities (container "fleet-agent" must set securityContext.capabilities.drop=["ALL"]; container "fleet-agent" must not include "ALL" in securityContext.capabilities.add), runAsNonRoot != true (container "fleet-agent" must not set securityContext.runAsNonRoot=false), seccompProfile (pod or container "fleet-agent" must set securityContext.seccompProfile.type to "RuntimeDefault" or "Localhost"):Updated: 0/1

I closely look at the PSA exempted namespace list and saw that cattle-fleet-local-system namespace which is managed by Fleet-controller is absent. In order test the patch is working, I followed below stesp:

  1. I updated PSA file by adding cattle-fleet-local-system namespace to exempted namespace list.
  2. Restarted rk2 server
  3. Updated the fleet-agents image I observed that fleet-agent image gets updated without any error.

@pjbgf @macedogm What should I do? should I create new issue for it, if yes, then where? please let me know thanks 😄

Ignore this comment as I see cattle-fleet-local-system is already exempted. (Here)

manno commented 12 months ago

/backport v2.8.0 release/v0.9

slickwarren commented 12 months ago

@manno or @sbulage it doesn't look like rancher has picked up the new RC. Is this something your team can update so that I can run another round of testing? https://github.com/rancher/charts/pull/3138

sbulage commented 12 months ago

@slickwarren There was some issue with CI in rancher/charts which seems like it is fixed. It will be available once rancher/charts#3138 merged(hopefully in next hour or so. :crossed_fingers: )

raulcabello commented 12 months ago

It is available now

sbulage commented 12 months ago

All below test scenario performed on RKE2 cluster with and without hardening it.

Environment Details:

QA TEST PLAN

Scenarios Scenario Test Case
1 Test GitRepo deploys application in newly created workspace when cluster is moved from fleet-default workspace to newly created workspace in Hardened cluster.
2 Test GitRepo deploys application in newly created workspace when cluster is moved from fleet-default workspace to newly created workspace in non-hardened cluster.
3 Test GitRepo deploys helm application in newly created workspace when cluster is moved from fleet-default workspace to newly created workspace in Hardened cluster.
4 Test private GitRepo deploys application in newly created workspace when cluster is moved from fleet-default workspace to newly created workspace in Hardened cluster.
5 Test private GitRepo deploys application in newly created workspace when cluster is moved from fleet-default workspace to newly created workspace non-hardened cluster.
sbulage commented 12 months ago

TEST RESULT

RKE2 hardened cluster is created by following documentation.

Scenarios Scenario Test Case Result
1 Test GitRepo deploys application in newly created workspace when cluster is moved from fleet-default workspace to newly created workspace in Hardened cluster. :heavy_check_mark:
2 Test GitRepo deploys application in newly created workspace when cluster is moved from fleet-default workspace to newly created workspace in non-hardened cluster. :heavy_check_mark:
3 Test GitRepo deploys helm application in newly created workspace when cluster is moved from fleet-default workspace to newly created workspace in Hardened cluster. :heavy_check_mark:
4 Test private GitRepo deploys application in newly created workspace when cluster is moved from fleet-default workspace to newly created workspace in Hardened cluster. :heavy_check_mark:
5 Test private GitRepo deploys application in newly created workspace when cluster is moved from fleet-default workspace to newly created workspace non-hardened cluster. :heavy_check_mark:

REPRO STEPS

RKE2 Hardened cluster used Scenario 1

  1. Create a new workspace new-workspace1.
  2. Create a GitRepo which deploys nginx.
  3. Go to the Continuous Delivery --> Clusters.
  4. Change workspace of the imported-cluster from fleet-default to new-workspace1.
  5. Verified that nginx app deployed cluster which moved to new-workspace without any issue.

RKE2 Non-hardened cluster used Scenario 2

  1. Create a new workspace new-workspace2.
  2. Create a GitRepo which deploys nginx.
  3. Go to the Continuous Delivery --> Clusters.
  4. Change workspace of the imported-cluster from fleet-default to new-workspace2.
  5. Verified that nginx app deployed cluster which moved to new-workspace without any issue.

RKE2 Hardened cluster used Scenario 3

  1. Create a new workspace new-workspace3.
  2. Create a GitRepo which deploys grafana helm application.
  3. Go to the Continuous Delivery --> Clusters.
  4. Change workspace of the imported-cluster from fleet-default to new-3.
  5. Verified that grafana app deployed cluster which moved to new-workspace without any issue.

RKE2 Hardened cluster used Scenario 4

  1. Create a new workspace new-workspace4.
  2. Create a GitRepo which deploys nginx from private github repository.
  3. Go to the Continuous Delivery --> Clusters.
  4. Change workspace of the imported-cluster from fleet-default to new-workspace4.
  5. Verified that nginx app deployed cluster which moved to new-workspace without any issue.

RKE2 Non-hardened cluster used Scenario 5

  1. Create a new workspace new-workspace5.
  2. Create a GitRepo which deploys grafana from private github repository.
  3. Go to the Continuous Delivery --> Clusters.
  4. Change workspace of the imported-cluster from fleet-default to new-workspace5.
  5. Verified that grafana app deployed cluster which moved to new-workspace without any issue.
kkaempf commented 12 months ago

@slickwarren - please give it a try 😉

slickwarren commented 12 months ago

my tests through rancher using rc5 are working well, and I've closed the rancher-side issues. Not sure of fleet's process for closing these, but rancher-qa signs off on this 👍🏼

sbulage commented 12 months ago

Thanks @slickwarren I will close this issue now. :+1: