Improve User & Group & Permission handling in our Docker images

We want to change the way we handle users, groups and permissions in our docker images. The reason this came up was because we learned that the SecurityContextConstraint (SCC) we initially used was too lenient. It did allow root users. While investigating a solution @razvan and I came up with this plan on how we can improve our user handling going forward.

Plan

Make the UID & GID configurable in our docker images ✅

Currently we hardcode the UID & GID & username in our Docker images.

One example:

groupadd --gid 1000 --system stackable

Using the new functionality to support global arguments in our bake process we want to extract the user id, user name and gid into arguments that can be changed easily.

These arguments will, for now, still default to the current values, even though that is not optimal and needs to be changed as well (see below for details). But because I don't know if any operators make any assumptions about the uid/gid (and fsgroup which is not handled here) we decided to split this into two.

Prepare our Docker images to allow larger user ids ✅

There is a documented issue that causes large UIDs to not work in Docker images. It is not entirely clear what a large UID is but as we definitely want to use one we need to apply the workaround.

This has been done for both our docker-images and operator-templating which applies to all operators as well:

Switch USER statements to numeric ✅

The USER statement in a Dockerfile ends up in an image's metadata:

This user is used as the default user when an image is started using plain Docker:

docker run -it --entrypoint bash docker.stackable.tech/stackable/druid:30.0.0-stackable0.0.0-dev

It is also the default when used as a plain Pod in Kubernetes:

kubectl run test --image=docker.stackable.tech/stackable/druid:30.0.0-stackable0.0.0-dev --rm=true --restart=Never --tty=true --stdin=true -- bash

In OpenShift this is what it looks like as an admin user (they are exempt from SCCs):

kubectl run test --image=docker.stackable.tech/stackable/hbase:2.6.0-stackable0.0.0-dev --rm=true --restart=Never --tty=true --stdin=true --namespace test -- id  

Warning: would violate PodSecurity "restricted:v1.24": allowPrivilegeEscalation != false (container "test" must set securityContext.allowPrivilegeEscalation=false), unrestricted capabilities (container "test" must set securityContext.capabilities.drop=["ALL"]), runAsNonRoot != true (pod or container "test" must set securityContext.runAsNonRoot=true), seccompProfile (pod or container "test" must set securityContext.seccompProfile.type to "RuntimeDefault" or "Localhost")
uid=1000(stackable) gid=1000(stackable) groups=1000(stackable)

Here is the same command run as a non-admin user (note the use of a non-1000 ID means that we bypass the SCC warning):

oc run test --as developer --image=docker.stackable.tech/stackable/hbase:2.6.0-stackable0.0.0-dev --rm=true --restart=Never --tty=true --stdin=true --namespace test -- id
uid=1000740000(1000740000) gid=0(root) groups=0(root),1000740000
pod "test" deleted

If we - or someone else - want to enforce that a user is non-root using the securityContext.runAsNonRootfield it will not work as Kubernetes has no way of mapping the string stackable to a UID (it is not aware of the implementation details inside the container, it could call out to LDAP for all it knows). Therefore this combination (non-numeric UID) and runAsNonRoot is forbidden and results in an error:

Don't hardcode any user or group id in our operators 👷

Our operators currently hardcode the FSGroup, RunAsUser and RunAsGroup and I believe this should be changed.

This is an extract from the NiFi operator today:

pub const NIFI_UID: i64 = 1000;
...
PodSecurityContextBuilder::new()
  .run_as_user(NIFI_UID)
  .run_as_group(0)
  .fs_group(1000)
  .build()

https://github.com/stackabletech/nifi-operator/blob/f46ee61e25c99a1703df945eeb3d326b25fb107f/rust/operator-binary/src/controller.rs#L1257-L1262

If we ever want to allow us to move to restricted-v2 SCC we cannot hardcode either of these three settings.

[ ] https://github.com/stackabletech/issues/issues/651

RunAsUser

Vanilla Kubernetes: By not specifying this we will default to the user id from our Dockerfiles which would be fine
OpenShift: By not specifying this the uid depends on the SCC being used but we can consider it an arbitrary number
- This should be fine if we follow all previous best practices from this issue

RunAsGroup

The same as for RunAsUser applies here: Once all previous suggestions have been applied it should be fine just removing this should work.

FSGroup

A special supplemental group that applies to all containers in a pod. Some volume types allow the Kubelet to change the ownership of that volume to be owned by the pod:

The owning GID will be the FSGroup

The setgid bit is set (new files created in the volume will be owned by FSGroup)

The permission bits are OR'd with rw-rw----

If unset, the Kubelet will not modify the ownership and permissions of any volume. Note that this field cannot be set when spec.os.name is windows.

https://kubernetes.io/docs/reference/generated/kubernetes-api/v1.31/#podsecuritycontext-v1-core

Currently, we hardcode the fsGroup to 1000 which works fine as all mounted volumes are then owned by that group and our pod user is also automatically added to that group. Some volume types allow us to change ownership (e.g. using defaultMode) but ephemeral volumes do not have this option.

This affects a few things we mount (e.g. TLS and Kerberos secrets from Secret Operator) and if we do not set a fsGroup these mounts will be owned by root:root and the mode will be such that our container user doesn't even have read access.

Therefore we do NEED to set an fsGroup for this to work. BUT this will not work with the restricted-v2 SCC from OpenShift as that is set to RunAsRange for the fsGroup and while there are defaults for that range we can't rely on it and if we pick a fsGroup outside of the range deployment of the Pod will fail.

Our options therefore are:

Set a fsGroup and it'll work on vanilla Kubernetes AND on OpenShift with SCCs that are using RunAsAny for fsGroup
Do NOT set a fsGroup and it'll work on OpenShift with SCCs that set RunAsRange as a fsGroup will be assigned automatically. It will however NOT work on any SCCs that use RunAsAny and it will also NOT work on vanilla Kubernetes as no fsGroup means we have no read access to the mounts.

We were thinking about adding a Webhook but webhooks are not yet supported by OLMv1 so it doesn't help us either. For now I therefore suggest we keep on hardcoding a fsGroup and anyone wanting to run on restricted-v2 will need to use a podOverride for now. In a later version we can then make it a flag in the CRD and we should also document this.

Another option is the following:

Detect if SCC API is available ("Am I running on OpenShift?")
If yes: Annotate with the annotation that says "require" restricted-v2
- This should be safe as this is the SCC that every user has access to by default
If not: Hardcode a random FSGroup

Conclusion

Some links:

https://kubernetes.io/blog/2021/12/09/pod-security-admission-beta/ https://stackoverflow.com/questions/69805813/fsgroup-vs-supplementalgroups

Tasks:

[ ] Remove hardcoded RunAsUser & RunAsGroup from all operators, this should not be needed and should be controlled using the image instead. At the same time change the hardcoded fsGroup to a higher random one as well (similar to what we do for UID/GID in docker images)
[x] Investigate what needs to be done about FSGroup, especially for Listener & Secret operator
[ ] Decide on a way forward with fsGroup

Use a different UID than 1000 in our docker images 👷

Using a hardcoded uid for our stackable user is a good idea in theory, in practice the id 1000 should be avoided.

This is because the users from Docker containers are mapped to users on the underlying host OS. Some OSes start "real" user ids at 1000 (or 500) and reserve everything before that to "system" users. User 1000 therefore has a good chance of being mapped to a real user that exists on the underlying system which should be avoided as this "host user" might have access to things that the container users should not have access to.

The easiest way of doing so is by picking an arbitrarily large (more or less) number to statically use in our Dockerfiles. This is exactly what OpenShift does by default. It picks a "random" UID from a range of UIDs (in reality it picks the first one from a range, see the MustRunAsRange attribute in an SCC or the openshift.io/sa.scc.uid-range annotation on a namespace). The UID is larger than 1.000.000.000 by default.

This step will require changes to operators as well as they hardcode assumptions about the user id. Therefore, we should probably tackle the products one-by-one. See the previous step.

The operators themselves (not the products they manage) have already been updated as of SDP 24.11 to run as a different user.

Change ownership of anything belonging to `stackable` user ✅

The users of our image might want to run the image with a different user than the stackable one we create. This can - for example - happen when the SCC restricted-v2 is being used which will select a "random" user.

For this to work, these users need to be able to access all the files that the stackable user also has access to. This can be achieved by changing the group of all files and folders to 0 as every container user will always belong to the root group (0).

We need to do something to the effect of

RUN <<EOF
chgrp -R 0 /stackable
chmod -R g=u /stackable
EOF

at the very end of all our Docker files, no new files or folders should be created after this step. This is also the way Red Hat recommends.

Investigate better defaults for our `securityContext`

We want to investigate if we should/could make any changes to the Pods (directly or indirectly) we write.

[ ] Investigate if we should set securityContext.runAsNonRoot to true
[ ] As soon as we can we might want to enable securityContext.supplementalGroupsPolicy and set it to Strict (https://kubernetes.io/blog/2024/08/22/fine-grained-supplementalgroups-control/)
[ ] https://github.com/stackabletech/issues/issues/628

Move to `restricted-v2`

As soon as all previous items are finished we should be able to move to the restricted-v2 SSC.

[ ] Change default SCC to restricted-v2 for OLM
[ ] Change default SCC to restricted-v2 for Helm packages

TODO/Research: ServiceAccount handling

Once everything else is done we should check our usage and handling of ServiceAccounts and if we need a custom one or if we can use one of the default ones (builder, deployer, default in OpenShift I believe)

Resources

These might contain more "best practices" or things to consider. Once all of the above is done we should go through these again and check if we handled everything.

stackabletech / issues