litmuschaos / litmus

Litmus helps SREs and developers practice chaos engineering in a Cloud-native way. Chaos experiments are published at the ChaosHub (https://hub.litmuschaos.io). Community notes is at https://hackmd.io/a4Zu_sH4TZGeih-xCimi3Q
https://litmuschaos.io
Apache License 2.0
4.44k stars 698 forks source link

Support OpenShift Installation (SCC compatibility) #3882

Open mtcolman opened 1 year ago

mtcolman commented 1 year ago

Hi, I'd like for the litmus install to be compatible with OpenShift SCCs. I have taken https://litmuschaos.github.io/litmus/3.0.0-beta2/litmus-3.0.0-beta2.yaml and modified it to be compatible and follow good security practices, my new version is attached below. I've written up an explanation, hopefully it all makes sense.

Note: I think a potential for an even better way to make this OpenShift compatible would be to not need to run the containers & initContainers as a specific UID/GID so that they could work with restricted SCC, but that would involve making changes to the container images themselves (i.e. chown/chgrp directory & file permissions to ensure compatibility etc) - happy to discuss that too if required.

I initially tried installing via oc apply -f https://litmuschaos.github.io/litmus/3.0.0-beta2/litmus-3.0.0-beta2.yaml and have found that none of the litmus pods are able to start up, only mongo does. Upon further investigation this is due to the configuration's incompatibility with the default security context constraint "restricted".

The appears to be caused by the fact the runAsUser securityContext is specified for the containers. I therefore need to add the permission to use the nonroot serviceAccount for these pods. I've found at this point that only the litmus-server has an SA created for it - the frontend and auth-server would default to using the default SA in the namespace - adding permissions to this would be bad security practice. I've therefore created a new SA for each of the frontend and auth-server deployments.

apiVersion: apps/v1
kind: Deployment
metadata:
  name: litmusportal-auth-server
  namespace: litmus
  ...
spec:
  ...
  template:
    ...
    spec:
      automountServiceAccountToken: false
      serviceAccountName: litmus-auth-server-account
      ...

and

apiVersion: apps/v1
kind: Deployment
metadata:
  name: litmusportal-frontend
  namespace: litmus
  ...
spec:
  ...
  template:
    ...
    spec:
      automountServiceAccountToken: false
      serviceAccountName: litmus-frontend-account
      ...

and

---
apiVersion: v1
kind: ServiceAccount
metadata:
  name: litmus-auth-server-account
  namespace: litmus
---
apiVersion: v1
kind: ServiceAccount
metadata:
  name: litmus-frontend-account
  namespace: litmus

I've then added the permission to use nonroot to the serviceAccounts via:

oc adm policy add-scc-to-user nonroot -z litmus-auth-server-account
oc adm policy add-scc-to-user nonroot -z litmus-frontend-account
oc adm policy add-scc-to-user nonroot -z litmus-server-account

Having removed the installation (3 x deployments, 1 x statefulSet (mongo)) and reinstalled, I've still hit problems with litmus-server and litmus-auth-server pods starting up. I then discovered this was due to the litmuschaos/curl:3.0.0-beta2 image being used as an initContainer within both. The error given is Error: container has runAsNonRoot and image has non-numeric user (curl_user), cannot verify user is non-root.

To overcome this I've had a look in the container to find the user and group assigned to the curl_user:

$ podman run -it --entrypoint=/bin/sh litmuschaos/curl:3.0-beta1
/ $ id
uid=100(curl_user) gid=101(curl_group) groups=101(curl_group)
/ $ cat /etc/passwd | grep curl
curl_user:x:100:101:Linux User,,,:/home/curl_user:/sbin/nologin

using this, I have then updated the initContainer specifications in these deployments to include the appropriate securityContext:

    spec:
      initContainers:
        - name: wait-for-mongodb
          image: litmuschaos/curl:3.0.0-beta2
          securityContext:
            runAsUser: 100
            runAsGroup: 101

I am then able to deploy litmuschaos in a manner that all pods start up:

$ oc get pod
NAME                                       READY   STATUS    RESTARTS   AGE
litmusportal-auth-server-8b7494dcc-hj8qq   1/1     Running   0          77m
litmusportal-frontend-564bc999d6-zjvfw     1/1     Running   0          77m
litmusportal-server-ddcdbfb86-576zf        1/1     Running   0          77m
mongo-0                                    1/1     Running   0          77m

$ oc get pod -oyaml | grep "serviceAccount\:"
    serviceAccount: litmus-auth-server-account
    serviceAccount: litmus-frontend-account
    serviceAccount: litmus-server-account
    serviceAccount: default

$ oc get pod -oyaml | grep scc
      openshift.io/scc: nonroot
      openshift.io/scc: nonroot
      openshift.io/scc: nonroot
      openshift.io/scc: restricted

I've noted that mongo is using the default service account, and the restricted SCC. And therefore for completeness I then create and assign a new service account for it, which will use the restricted SCC:

apiVersion: apps/v1
kind: StatefulSet
metadata:
  name: mongo
  namespace: litmus
  ...
spec:
  ...
  template:
    ...
    spec:
      automountServiceAccountToken: false
      serviceAccountName: litmus-mongo-account
      ...
---
apiVersion: v1
kind: ServiceAccount
metadata:
  name: litmus-mongo-account
  namespace: litmus

Following a deployment, I then have:

$ oc get pod
NAME                                       READY   STATUS    RESTARTS   AGE
litmusportal-auth-server-8b7494dcc-69lqz   1/1     Running   0          116s
litmusportal-frontend-564bc999d6-pzrcx     1/1     Running   0          118s
litmusportal-server-ddcdbfb86-l89bl        1/1     Running   0          117s
mongo-0                                    1/1     Running   0          115s

$ oc get pod -oyaml | grep "serviceAccount\:"
    serviceAccount: litmus-auth-server-account
    serviceAccount: litmus-frontend-account
    serviceAccount: litmus-server-account
    serviceAccount: litmus-mongo-account

$ oc get pod -oyaml | grep scc
      openshift.io/scc: nonroot
      openshift.io/scc: nonroot
      openshift.io/scc: nonroot
      openshift.io/scc: restricted

With all this in place, I've been able to log in: image

Not yet had a chance to test in further depth.

Here is the yaml file I have with my changes in it (had to convert to .txt to upload it to GitHub): litmus-3.0.0-beta2_MC.yaml.txt

Also: I've seen the deployments that subsequently get created get stuck as well (see READY 0/1):

$ oc get deploy
NAME                       READY   UP-TO-DATE   AVAILABLE   AGE
chaos-exporter             0/1     0            0           35m
chaos-operator-ce          0/1     0            0           35m
event-tracker              0/1     0            0           35m
litmusportal-auth-server   1/1     1            1           38m
litmusportal-frontend      1/1     1            1           39m
litmusportal-server        1/1     1            1           38m
subscriber                 0/1     0            0           35m
workflow-controller        0/1     0            0           35m

and to resolve this, needed to perform the following on the new SAs:

oc adm policy add-scc-to-user nonroot -z litmus-cluster-scope
oc adm policy add-scc-to-user nonroot -z litmus-namespace-scope
oc adm policy add-scc-to-user nonroot -z event-tracker-sa
oc adm policy add-scc-to-user nonroot -z litmus
oc adm policy add-scc-to-user nonroot -z argo
neelanjan00 commented 1 year ago

Hi @mtcolman, have you gone through this doc: https://litmuschaos.github.io/litmus/experiments/concepts/security/openshift-scc/

mtcolman commented 1 year ago

@neelanjan00 yes I have, and it's nothing to with the above; the link explains how to set up an SSC for the privileged experiments (which is after all of the above).

see-quick commented 1 year ago

As a user, I assume then when I try to deploy an operator using the Helm tool I would not do the following steps with the SCCs. This is not user-friendly at all.

SrikanthSrinivasamurthy commented 11 months ago

Hi @mtcolman We have one more issue with respect openshift SCC since we have restricted annotation added to pods of deployment during runtime which restrict securityContext.runAsUser

    openshift.io/scc: restricted-v2
    seccomp.security.alpha.kubernetes.io/pod: runtime/default

during deployment of litmus-3.0.2 in to openshift getting this below error image

same with front-end & auth-server deployement This is override file used

portalScope: cluster

portal:
  frontend:
    automountServiceAccountToken: false
    securityContext: {}
      #runAsUser: 2000
      #allowPrivilegeEscalation: false
      #runAsNonRoot: true
  server:
    replicas: 1
    updateStrategy: {}
    serviceAccountName: litmus-server-account
    customLabels: {}
    waitForMongodb:
      securityContext: {}
        #runAsUser: 101
        #allowPrivilegeEscalation: false
        #runAsNonRoot: true
        #readOnlyRootFilesystem: true
    # my.company.com/tier: "backend"
    graphqlServer:
      securityContext: {}
        #runAsUser: 2000
        #allowPrivilegeEscalation: false
        #runAsNonRoot: true
        #readOnlyRootFilesystem: true

    authServer:
      securityContext:
        runAsUser: 2000
        allowPrivilegeEscalation: false
        runAsNonRoot: true
        readOnlyRootFilesystem: true
      automountServiceAccountToken: false
mongodb:
  enabled: true
  auth:
    enabled: true
    rootUser: "root"
    rootPassword: "1234"
    # -- existingSecret Existing secret with MongoDB(®) credentials (keys: `mongodb-passwords`, `mongodb-root-password`, `mongodb-metrics-password`, ` mongodb-replica-set-key`)
    existingSecret: ""
  architecture: replicaset
  replicaCount: 3
  persistence:
    enabled: false
  volumePermissions:
    enabled: false
  metrics:
    enabled: false
    prometheusRule:
      enabled: false
  podSecurityContext: 
      enabled: false
  containerSecurityContext:
    enabled: false
  volumePermissions:
    enabled: false
    securityContext: {}
  tls:
    enabled: false

Since its giving above error i was giving securityContext: {} but still in frontend its giving permission denied error for /etc/nginx/nginx.conf

mtcolman commented 11 months ago

@SrikanthSrinivasamurthy this is likely due to the fact that access to /etc/nginx/nginx.conf has not been granted to the arbitrary UID or GID 0 which you'll be assigned when using restricted/restricted-v2 SCC. So either: