Open matttrach opened 12 months ago
The approach on this one will be to enable immutable infrastructure:
focus on RHEL as first hardened OS
The CIS Benchmarks appear to be the standard for how to achieve the hardened OS, CIS also provides custom AMIs on AWS that are pre-configured for their benchmarks. The STIG benchmark for RHEL is the one which we should use for servers. There is also a distribution independent benchmark that we might use for other server types, it contains multiple levels of suggestions, look for the "server - level 2" suggestions.
To harden RKE2 on Rhel8 we should be able to get by with setting the cis config as follows along with adding a user for etcd and setting the profile
flag in the config.
small script to enable cis conf:
sudo cp -f /usr/share/rke2/rke2-cis-sysctl.conf /etc/sysctl.d/60-rke2-cis.conf && \
sudo systemctl restart systemd-sysctl && \
sudo useradd -r -c "etcd user" -s /sbin/nologin -M etcd -U
example cis profile enabled rke2 config:
write-kubeconfig-mode: 644
cni: calico
cloud-provider-name: "aws"
profile: "cis-1.23"
selinux: true
This requires enabling an extra config on top of what is necessary for clustering, adding the ability to inject a script to prep the OS for running rke2 after install, but before first start.
Enabling the RHEL8 STIG AMI: https://github.com/rancher/terraform-aws-server/pull/20
The changes there will need to be propagated to the install and rke2 modules and their examples. Then we should be able to inject a script to install the selinux policies before starting rke2.
Propagate CIS to install module with example cis configuration: https://github.com/rancher/terraform-null-rke2-install/pull/51
I am currently working on adding a local repo to the server to enable air-gapped rpm installs with selinux enforcing on the CIS AMI.
system-default-registry
option to configure custom image repothe latest changes to aws-rke2 module include:
Next up:
Prioritizing by difficulty/time consumption:
These are not small items unfortunately, it will take me some time to get these things figured out.
In the mean time here is a repo showing how to get everything else running: https://github.com/rancher/terraform-aws-rke2-live-example
This has a full IAC of an RKE2 node with an airgapped server that you can only access via the AWS serial console. It deploys a "prototype" server which has access to download the things it needs before shutting down and getting turned into an image. The production server is then deployed using that image and an updated config to set the proper ip addresses and join token. The repo is set up to be fully IAC meaning that users manage their infrastructure like code artifacts in a repo, it has CI to test and automatically deploy infrastructure. Secrets are encrypted and the encryption is automatically rotated weekly. Each user has their own key to decrypt the secrets, and one exists for the CI that is not viewable without a code change.
State is stored encrypted in the repo, as well as all of the access necessary for the CI to deploy. The CI is the public github runner and is completely free (3k min for a private repo, but unlimited for public, in my experience it is pretty hard to reach that 3k min using just one repo). Users don't need in-depth (or any) knowledge of terraform to use the example, but maintainers will need to understand what they are looking at to make educated changes.
CI access is created before every run and destroyed at the end making it very limited. CI never has access to production servers (they don't have public IP addresses).
I am going to move this issue to our backlog as I don't have a clear timeline.
This now aligns with https://github.com/rancher/rke2/issues/5541. I will make sure to update both so everyone is on the same page, but it will have the most up to date information. I expect to implement items there into the example repo and I will add a summary here when I do.
Dualstack and SLE micro are being propagated through the system, next challenge is the embedded registry.
This tracks progress on satisfying a hardened RKE2 use case.
We will need to harden the OS
We will need to follow the hardening guide for RKE2: https://docs.rke2.io/security/hardening_guide