Closed davidnuzik closed 3 years ago
Validated rancher v2.6.0-rc8 deploys correctly on rke2 using rhel 8.4. Charts work fine, downstream clusters deploy and upgrade correctly, and automated checks are successful.
Logging does not work for me. I had to patch rancher-logging logging resource.
@frjaraur Hi, thank you for letting us know. Will you please open an issue in https://github.com/rancher/rancher with the details on that?
Sure!!. I will try to explain all the steps, configurations I did (with some workarounds) and issues found.
My RKE2 cluster is build on top of Red Hat 8.4 (customized with CIS security settings). SElinux in enabled in Enforcing mode.
$ rke2 --version
rke2 version v1.21.3+rke2r1 (2ed0b0d1b6924af4414393cd1796c174a1ff5352)
go version go1.16.6b7
>> kubectl get nodes
NAME STATUS ROLES AGE VERSION
whatevercc1 Ready control-plane,etcd,master 28d v1.21.3+rke2r1
whatevercc2 Ready control-plane,etcd,master 28d v1.21.3+rke2r1
whatevercc3 Ready control-plane,etcd,master 28d v1.21.3+rke2r1
whateverwc1 Ready ingress,rancher,worker 28d v1.21.3+rke2r1
whateverwc2 Ready ingress,rancher,worker 28d v1.21.3+rke2r1
whateverwc3 Ready ingress,rancher,worker 28d v1.21.3+rke2r1
whateverwc4 Ready worker 27d v1.21.3+rke2r1
My RKE2 settings:
# cat /etc/rancher/rke2/config.yaml
write-kubeconfig-mode: "0600"
profile: "cis-1.5"
selinux: true
disable-cloud-controller: true
token: "WHATEVERTOKEN"
tls-san:
- rke2c.whatever
cluster-cidr: "10.42.0.0/16"
service-cidr: "10.43.0.0/16"
cluster-dns: "10.43.0.10"
cluster-domain: "rke2.secure"
node-taint:
- "node-role.kubernetes.io/master=true:NoSchedule"
Therefore CIS 1.5 for RKE2 is applied and we have a bunch of PSP applied by default.
I first tried logging charts deployment using latest Rancher 2.5 release catalog, I found issues regarding PSP settings (reported to banzaicloud/logging-operator here https://github.com/banzaicloud/logging-operator/issues/830). I tried different combinations for the logging deployment at "values" configuration using Rancher GUI but finally moved to helm charts provided by banzaicloud. The problem I found is that fluentd deployment does not inherit neither Security Context nor Pod Security Context configurations when logging-operator-logging is applied. After some research and tries, I was able to make it work patching logging definition created.
spec:
fluentd:
fluentOutLogrotate:
enabled: false
security:
podSecurityPolicyCreate: true
podSecurityContext:
runAsNonRoot: true
runAsUser: 1000
This patch solves two issues:
In the meantime, I upgraded to Rancher 2.6 and tried everything again. I was in the same situation but I managed to make Rancher´s catalog Logging using following values file (after some tries combining Pod Security Context and Security Context), anyway, patch should be applied on "logging" resource (in this case rancher-logging). This is the full values file applied to Rancher's catalog Logging deployment:
# Values file used for Rancher-Logging deployment.
additionalLoggingSources:
aks:
enabled: false
eks:
enabled: false
gke:
enabled: false
k3s:
container_engine: systemd
enabled: false
stripUnderscores: false
kubeAudit:
auditFilename: ''
enabled: false
fluentbit:
logTag: kube-audit
tolerations:
- effect: NoSchedule
key: node-role.kubernetes.io/controlplane
value: 'true'
- effect: NoExecute
key: node-role.kubernetes.io/etcd
value: 'true'
pathPrefix: ''
rke:
enabled: false
fluentbit:
log_level: info
mem_buffer_limit: 5MB
rke2:
enabled: true
stripUnderscores: false
affinity: {}
annotations: {}
createCustomResource: true
disablePvc: true
extraArgs:
- '-enable-leader-election=true'
fluentbit:
inputTail:
Buffer_Chunk_Size: ''
Buffer_Max_Size: ''
Mem_Buf_Limit: ''
Multiline_Flush: ''
Skip_Long_Lines: ''
resources: {}
tolerations:
- effect: NoSchedule
key: node-role.kubernetes.io/controlplane
value: 'true'
- effect: NoExecute
key: node-role.kubernetes.io/etcd
value: 'true'
security:
podSecurityPolicyCreate: true
podSecurityContext:
runAsNonRoot: true
runAsUser: 1000
securityContext:
runAsNonRoot: true
runAsUser: 1000
fluentd:
bufferStorageVolume: {}
livenessProbe:
initialDelaySeconds: 30
periodSeconds: 15
tcpSocket:
port: 24240
nodeSelector: {}
resources: {}
tolerations: {}
fluentOutLogrotate:
enabled: false
security:
podSecurityPolicyCreate: true
podSecurityContext:
runAsNonRoot: true
runAsUser: 1000
securityContext:
runAsNonRoot: true
runAsUser: 1000
fullnameOverride: ''
global:
cattle:
systemDefaultRegistry: ''
dockerRootDirectory: ''
psp:
enabled: false
rkeWindowsPathPrefix: c:\
seLinux:
enabled: true
http:
port: 8080
service:
annotations: {}
clusterIP: None
labels: {}
type: ClusterIP
image:
pullPolicy: IfNotPresent
repository: rancher/mirrored-banzaicloud-logging-operator
tag: 3.12.0
imagePullSecrets: []
images:
config_reloader:
repository: rancher/mirrored-jimmidyson-configmap-reload
tag: v0.4.0
fluentbit:
repository: rancher/mirrored-fluent-fluent-bit
tag: 1.7.9
fluentbit_debug:
repository: rancher/mirrored-fluent-fluent-bit
tag: 1.7.9-debug
fluentd:
repository: rancher/mirrored-banzaicloud-fluentd
tag: v1.12.4-alpine-1
nodeagent_fluentbit:
os: windows
repository: rancher/fluent-bit
tag: 1.7.4
monitoring:
serviceMonitor:
additionalLabels: {}
enabled: false
metricRelabelings: []
relabelings: []
nameOverride: ''
namespaceOverride: ''
nodeAgents:
tls:
enabled: false
nodeSelector:
kubernetes.io/os: linux
podLabels: {}
podSecurityContext:
runAsNonRoot: true
runAsUser: 1000
priorityClassName: {}
rbac:
enabled: true
psp:
annotations:
seccomp.security.alpha.kubernetes.io/allowedProfileNames: docker/default,runtime/default
seccomp.security.alpha.kubernetes.io/defaultProfileName: runtime/default
enabled: true
replicaCount: 1
resources: {}
securityContext:
allowPrivilegeEscalation: false
readOnlyRootFilesystem: false
capabilities:
drop: ["ALL"]
runAsNonRoot: true
runAsUser: 1000
systemdLogPath: /run/log/journal
tolerations:
- effect: NoSchedule
key: cattle.io/os
operator: Equal
value: linux
Notice that selinux configurations are applied (rancher-selinux is also installed), PodSecurityContext and SecurityContext are configured, but as I said, they are not really working (not applied to neither fluentd dm nor sts). The funny thing here is that rke2-logging is not deployed even it is configured (it will be deployed using default values file, but of course it does not works because PSP features).
I patched rancher-logging with security settings:
kubectl patch logging rancher-logging --patch "$(cat rancher-logging.patch.yaml)" --type=merge
This is the complete patch file:
spec:
fluentbit:
security:
serviceAccount: rancher-logging
fluentd:
fluentOutLogrotate:
enabled: false
security:
podSecurityPolicyCreate: true
podSecurityContext:
runAsNonRoot: true
runAsUser: 1000
serviceAccount: rancher-logging
You can notice that I added "rancher-logging" Service Account. As I realized that rke2-logging wasn´t deployed, I prepared my own logging resource (using customized values from other rke cluster settings, paths will change of course). I just prepared a simple audit-logging logging resource with /var/lib/rancher/rke2/server/logs/audit.log, just to get this log. This is teh content of my audit-logging-rke2.logging.yaml
apiVersion: logging.banzaicloud.io/v1beta1
kind: Logging
metadata:
labels:
app.kubernetes.io/name: audit-logging
name: audit-logging-rke2
spec:
controlNamespace: logging
fluentbit:
security:
podSecurityPolicyCreate: true
serviceAccount: rancher-logging
extraVolumeMounts:
- destination: /var/lib/rancher/rke2/server/logs
readOnly: true
source: /var/lib/rancher/rke2/server/logs
image:
repository: rancher/fluent-fluent-bit
tag: 1.6.4
inputTail:
Parser: json
Path: /var/lib/rancher/rke2/server/logs/audit.log
Tag: rke2
nodeSelector:
kubernetes.io/os: linux
tolerations:
- effect: NoSchedule
key: cattle.io/os
operator: Equal
value: linux
- effect: NoSchedule
key: node-role.kubernetes.io/controlplane
value: "true"
- effect: NoExecute
key: node-role.kubernetes.io/etcd
value: "true"
fluentd:
fluentOutLogrotate:
enabled: false
security:
podSecurityPolicyCreate: true
podSecurityContext:
runAsNonRoot: true
runAsUser: 1000
serviceAccount: rancher-logging
configReloaderImage:
repository: rancher/jimmidyson-configmap-reload
tag: v0.2.2
disablePvc: true
image:
repository: rancher/banzaicloud-fluentd
tag: v1.11.5-alpine-1
nodeSelector:
kubernetes.io/os: linux
tolerations:
- effect: NoSchedule
key: cattle.io/os
operator: Equal
value: linux
Because PSP are applied and by default hostPath volumes are not allowed to be used I tried adding rancher-logging service account but I am still stuck with following event:
13m Warning FailedCreate daemonset/rancher-logging-rke2-journald-aggregator Error creating: pods "rancher-logging-rke2-journald-aggregator-" is forbidden: PodSecurityPolicy: unable to admit pod: [spec.volumes[1]: Invalid value: "hostPath": hostPath volumes are not allowed to be used spec.volumes[2]: Invalid value: "hostPath": hostPath volumes are not allowed to be used spec.volumes[3]: Invalid value: "hostPath": hostPath volumes are not allowed to be used]
I found that my settings are not applied:
kubectl get ds rancher-logging-rke2-journald-aggregator -o yaml
apiVersion: apps/v1
kind: DaemonSet
metadata:
annotations:
deprecated.daemonset.template.generation: "1"
meta.helm.sh/release-name: rancher-logging
meta.helm.sh/release-namespace: cattle-logging-system
creationTimestamp: "2021-09-15T12:41:35Z"
generation: 1
labels:
app.kubernetes.io/managed-by: Helm
name: rancher-logging-rke2-journald-aggregator
namespace: cattle-logging-system
resourceVersion: "16840954"
uid: 2266df4c-6ed2-4f57-866b-7171446243e2
spec:
revisionHistoryLimit: 10
selector:
matchLabels:
name: rancher-logging-rke2-journald-aggregator
template:
metadata:
annotations:
checksum/config: 2f9f5c4dd58a8c52ea3331479642e88da00b897d93b00a91e449ac8bb0895c7c
creationTimestamp: null
labels:
name: rancher-logging-rke2-journald-aggregator
name: rancher-logging-rke2-journald-aggregator
namespace: cattle-logging-system
spec:
containers:
- image: rancher/mirrored-fluent-fluent-bit:1.7.9
imagePullPolicy: IfNotPresent
name: fluentbit
resources: {}
securityContext:
seLinuxOptions:
type: rke_logreader_t
terminationMessagePath: /dev/termination-log
terminationMessagePolicy: File
volumeMounts:
- mountPath: /fluent-bit/etc/
name: config
- mountPath: /run/log/journal
name: journal
readOnly: true
- mountPath: /var/lib/rancher/rke2/agent/logs/kubelet.log
name: kubelet
readOnly: true
- mountPath: /etc/machine-id
name: machine-id
readOnly: true
dnsPolicy: ClusterFirst
nodeSelector:
kubernetes.io/os: linux
restartPolicy: Always
schedulerName: default-scheduler
securityContext: {}
serviceAccount: rancher-logging-rke2-journald-aggregator
serviceAccountName: rancher-logging-rke2-journald-aggregator
terminationGracePeriodSeconds: 30
tolerations:
- effect: NoSchedule
key: cattle.io/os
operator: Equal
value: linux
volumes:
- configMap:
defaultMode: 420
name: rancher-logging-rke2
name: config
- hostPath:
path: /run/log/journal
type: ""
name: journal
- hostPath:
path: /var/lib/rancher/rke2/agent/logs/kubelet.log
type: ""
name: kubelet
- hostPath:
path: /etc/machine-id
type: ""
name: machine-id
updateStrategy:
rollingUpdate:
maxSurge: 0
maxUnavailable: 1
type: RollingUpdate
status:
currentNumberScheduled: 0
desiredNumberScheduled: 0
numberMisscheduled: 0
numberReady: 0
Neither Security Context nor Service Account are configured with my settings. Not sure if my values are not valid or just operator does not use them. And the thing is that rancher-logging-rke2-journald-aggregator does not have any PSP associated, but rancher-logging does, with the right hostPath permissions
📙 PSP rancher-logging-fluentd └── 📓 Role cattle-logging-system/rancher-logging-fluentd-psp └── 📓 RoleBinding cattle-logging-system/rancher-logging-fluentd-psp └── 📗 Subject{Kind: ServiceAccount, Name: rancher-logging, Namespace: cattle-logging-system}
>> kubectl get psp rancher-logging-fluentd
Warning: policy/v1beta1 PodSecurityPolicy is deprecated in v1.21+, unavailable in v1.25+
NAME PRIV CAPS SELINUX RUNASUSER FSGROUP SUPGROUP READONLYROOTFS VOLUMES
rancher-logging-fluentd false RunAsAny MustRunAs MustRunAs MustRunAs false configMap,emptyDir,secret,hostPath,persistentVolumeClaim
And this is where I am stuck now.... sorry for this long and probably chaotic steps and workarounds description.
I will love to make everything work with Rancher´s catalog Logging, but patches and issues found because I think it is not really prepared for secure environments.
If I finally make it work I will write a guide and try to fix/PR the patches and workarounds used.
Let me know if there is something you want me to try.
@frjaraur can you move that info into a new issue on the rancher/rancher project? It doesn't belong on this QA validation issue.
Hi, I opened this issue 24 days ago :|, no luck rancher - #343871. I will add all my research and workarounds on these days.