redhat-openshift-ecosystem / openshift-preflight

Operator and container preflight certification tests
Apache License 2.0
58 stars 65 forks source link

CatalogSource creation fails because of PSA #867

Open betoredhat opened 1 year ago

betoredhat commented 1 year ago

Bug Description

During operators check "DeployableByOLM" a namespace is created to deploy a catalog Source. Starting 4.12, the catalog source pods cannot be deployed because the NS is not properly labeled to comply with the new Pod Security Admission.

Version and Command Invocation

podman run --rm --privileged --security-opt=label=disable -e PFLT_JUNIT=true -e PFLT_ARTIFACTS=/artifacts -e PFLT_LOGFILE=/artifacts/preflight.log -e PFLT_LOGLEVEL=trace -e DOCKER_CONFIG=/opt -e PFLT_DOCKERCONFIG=/opt/config.json -e KUBECONFIG=/kubeconfig -e PFLT_INDEXIMAGE=my-registry:4443/disconnected-catalog@sha256:72ea908e6244691c23aaed1ac59316bd933d35a8451c7579ddadeca935f885a2 -e PFLT_NAMESPACE=preflight-testing -e PFLT_SERVICEACCOUNT=default -e PFLT_SCORECARD_IMAGE=quay.io/operator-framework/scorecard-test@sha256:b939bdf25de65d0c4256ac799b02a58335b705e24cbdd66d61e344982eb6156d -v /tmp/kubeconfig:/kubeconfig -v /tmp/:/artifacts -v /tmp/config.json:/opt/config.json quay.io/opdev/preflight:1.4.2 check operator my-registry/simple-demo-operator-bundle@sha256:6cfbca9b14a51143cfc5d0d56494e7f26ad1cd3e662eedd2bcbebf207af59c86

$ oc get catalogsource simple-demo-operator -o yaml|tail  status: message: 'couldn''t ensure registry server - error ensuring pod: : error creating new pod: simple-demo-operator-: pods "simple-demo-operator-4gsd4" is forbidden: violates PodSecurity "restricted:v1.24": allowPrivilegeEscalation != false (container "registry-server" must set securityContext.allowPrivilegeEscalation=false), unrestricted capabilities (container "registry-server" must set securityContext.capabilities.drop=["ALL"]), runAsNonRoot != true (pod or container "registry-server" must set securityContext.runAsNonRoot=true), seccompProfile (pod or container "registry-server" must set securityContext.seccompProfile.type to "RuntimeDefault" or "Localhost")' reason: RegistryServerError

(The output of preflight --version) 1.4.2

Steps to Reproduce:

(How can we reproduce this?) 1) Run a check operator on top of an OCP 4.12 (or newer) cluster 2) Review the catalogsource status

Expected Result

The catalog source pods should be created and the packages provided listed properly

Actual Result

(What actually happened) container "registry-server" must set securityContext.capabilities.drop=["ALL"]), runAsNonRoot != true (pod or container "registry-server" must set securityContext.runAsNonRoot=true

Additional Context

Understanding and managing pod security admission

komish commented 1 year ago

@betoredhat This seems to happen when using sqlite-based index images because those require writing to the filesystem, and I believe the default Dockerfile generated by tools like opm set the working directory to be /.

Can you try using a File-Based Catalog?

We'll look at implementing the workaround as you describe it until sqlite-based catalogs are fully deprecated.

betoredhat commented 1 year ago

Hello @komish, thanks for checking this.

The disconnected catalog image that we are using is built using FBC. The issue is more with the OLM registry-server pods that required privileged permission. Here we are defining a custom namespace for its deployment, that NS should be tagged accordingly.

The ideal situation should be that the registry-server image does not require PrivilegeEscalation permissions. But I think that at this moment that needs to be fixed by the OLM operator project.

As a workaround in our environment, we are creating in advance the namespace for the operator to be checked. That allows the preflight tasks to complete successfully.

- name: "Create a namespace for {{ operator.name }}"
  community. kubernetes.k8s:
    definition:
      apiVersion: v1
      kind: Namespace
      metadata:
        name: "{{ operator.name }}"
        labels:
          security.openshift.io/scc.podSecurityLabelSync: "false"
          pod-security.kubernetes.io/enforce: privileged
          pod-security.kubernetes.io/enforce-version: latest
    wait: yes

Cheers

komish commented 1 year ago

@betoredhat Can you let us know what version of OCP you're testing against? Specifically, what z-patch version of 4.12. We've tested in two so far (4.12.0 and 4.12.0-ec.5) and saw drastically different behaviors.

betoredhat commented 1 year ago

Actually, the failure was in 4.13. It was supposed that PSA was going to be enforced in 4.12 that seems to be delayed. I incorrectly assumed 4.12 was also affected.

Thanks

komish commented 1 year ago

I wanted to capture a quick recap of what we discovered while troubleshooting this.

File-Based Catalogs/Index Images do not appear to require additional privileges to run within OpenShift. Legacy SQLite-based catalogs/index images do, typically, because they need to write a database file to the filesystem on start up when executing opm registry serve, and their working directory is typically a location that requires elevated privileges. A workaround might be to create these registry images working in a temporary/writeable directory, but most of these images, I would say, are created using opm as a tool, which helps generate a Dockerfile.

Regardless, with the enforcement of PSA, this can cause issues with the older sqlite-based catalogs which seems to require the pod-security.kubernetes.io/enforce: privileged as @betoredhat points out.

The problem is that Preflight, as well as most users, are utilizing the CatalogSource CRD to create their catalogs in their cluster. CatalogSource CRDs generate their pod with the required spec and metadata. Newer "versions" of the CatalogSource CRD have a new spec item SecurityContextConfig which allows for the user to inform the required pod spec elements needed to run a sqlite-based catalog. https://github.com/operator-framework/api/blob/master/pkg/operators/v1alpha1/catalogsource_types.go#L123-L138

Older versions of the CatalogSource CRD do not seem to have this key, and so it stands to reason the corresponding controller would likely deploy the pod one way, always and forever.

We also observed that the default behavior for this SecurityContextConfig key was "restricted", which would mean that a user that does not set this field would run with in compliance with the Restricted pod security configuration. SQlite based catalogs would fail. At some point, however, this default was flipped so that "legacy" was the default, and so File-Based catalogs run with the elevated permissions (I assume) should a user forget to set this to restricted, and the default behavior is backwards compatible.

All of that to say that at some point, it's likely Preflight will need to account for this change given that the expectations are that it runs on a few versions older than 4.12. In addition, it seems like PSA enforcement doesn't seem to apply in the latest build of 4.12. We'll see if it lands at release, but at this moment, we don't need to make any changes to account for this in Preflight - but will need to review it in the near-ish future.

Thanks @betoredhat for catching it. Feel free to correct anything that I misunderstood. I'll edit this post should there be anything documented incorrectly.

acornett21 commented 1 year ago

@komish Great writeup, I'd just like to add one other thing I noticed.

If tekton operator is in the cluster, it updates all namespaces to be pod-security.kubernetes.io/enforce: baseline and with this Preflight and untimely the SQLite or File-Based Catalog catalog spin up properly. This is really strange to me on two accounts:

Nothing actionable on this for now, but things to consider when we know that the new PSA is going into an OpenShift version.

komish commented 1 year ago

I went ahead and pushed some of the code I was hacking through for the time being. Not to be used, but just as a reference point. https://github.com/komish/openshift-preflight/commit/a9d0b88b9d09c4e580ab2c5859a28124da533b9c