rancher / rke2

https://docs.rke2.io/
Apache License 2.0
1.58k stars 268 forks source link

Validate Rancher Server 2.6 RC on RHEL7.x/8.x RKE2 HA cluster #995

Closed davidnuzik closed 3 years ago

davidnuzik commented 3 years ago
rancher-max commented 3 years ago

Validated rancher v2.6.0-rc8 deploys correctly on rke2 using rhel 8.4. Charts work fine, downstream clusters deploy and upgrade correctly, and automated checks are successful.

frjaraur commented 3 years ago

Logging does not work for me. I had to patch rancher-logging logging resource.

rancher-max commented 3 years ago

@frjaraur Hi, thank you for letting us know. Will you please open an issue in https://github.com/rancher/rancher with the details on that?

frjaraur commented 3 years ago

Sure!!. I will try to explain all the steps, configurations I did (with some workarounds) and issues found.

My RKE2 cluster is build on top of Red Hat 8.4 (customized with CIS security settings). SElinux in enabled in Enforcing mode.

$ rke2 --version
rke2 version v1.21.3+rke2r1 (2ed0b0d1b6924af4414393cd1796c174a1ff5352)
go version go1.16.6b7

>> kubectl get nodes
NAME            STATUS   ROLES                       AGE   VERSION
whatevercc1   Ready    control-plane,etcd,master   28d   v1.21.3+rke2r1
whatevercc2   Ready    control-plane,etcd,master   28d   v1.21.3+rke2r1
whatevercc3   Ready    control-plane,etcd,master   28d   v1.21.3+rke2r1
whateverwc1   Ready    ingress,rancher,worker      28d   v1.21.3+rke2r1
whateverwc2   Ready    ingress,rancher,worker      28d   v1.21.3+rke2r1
whateverwc3   Ready    ingress,rancher,worker      28d   v1.21.3+rke2r1
whateverwc4   Ready    worker                      27d   v1.21.3+rke2r1

My RKE2 settings:

# cat /etc/rancher/rke2/config.yaml
write-kubeconfig-mode: "0600"
profile: "cis-1.5"
selinux: true
disable-cloud-controller: true
token: "WHATEVERTOKEN"
tls-san:
- rke2c.whatever
cluster-cidr: "10.42.0.0/16"
service-cidr: "10.43.0.0/16"
cluster-dns: "10.43.0.10"
cluster-domain: "rke2.secure"
node-taint:
- "node-role.kubernetes.io/master=true:NoSchedule"

Therefore CIS 1.5 for RKE2 is applied and we have a bunch of PSP applied by default.

I first tried logging charts deployment using latest Rancher 2.5 release catalog, I found issues regarding PSP settings (reported to banzaicloud/logging-operator here https://github.com/banzaicloud/logging-operator/issues/830). I tried different combinations for the logging deployment at "values" configuration using Rancher GUI but finally moved to helm charts provided by banzaicloud. The problem I found is that fluentd deployment does not inherit neither Security Context nor Pod Security Context configurations when logging-operator-logging is applied. After some research and tries, I was able to make it work patching logging definition created.

spec:
  fluentd:
    fluentOutLogrotate:
      enabled: false
    security:
      podSecurityPolicyCreate: true
      podSecurityContext:
        runAsNonRoot: true
        runAsUser: 1000

This patch solves two issues:

In the meantime, I upgraded to Rancher 2.6 and tried everything again. I was in the same situation but I managed to make Rancher´s catalog Logging using following values file (after some tries combining Pod Security Context and Security Context), anyway, patch should be applied on "logging" resource (in this case rancher-logging). This is the full values file applied to Rancher's catalog Logging deployment:

# Values file used for Rancher-Logging deployment.

additionalLoggingSources:
  aks:
    enabled: false
  eks:
    enabled: false
  gke:
    enabled: false
  k3s:
    container_engine: systemd
    enabled: false
    stripUnderscores: false
  kubeAudit:
    auditFilename: ''
    enabled: false
    fluentbit:
      logTag: kube-audit
      tolerations:
        - effect: NoSchedule
          key: node-role.kubernetes.io/controlplane
          value: 'true'
        - effect: NoExecute
          key: node-role.kubernetes.io/etcd
          value: 'true'
    pathPrefix: ''
  rke:
    enabled: false
    fluentbit:
      log_level: info
      mem_buffer_limit: 5MB
  rke2:
    enabled: true
    stripUnderscores: false
affinity: {}
annotations: {}
createCustomResource: true
disablePvc: true
extraArgs:
  - '-enable-leader-election=true'
fluentbit:
  inputTail:
    Buffer_Chunk_Size: ''
    Buffer_Max_Size: ''
    Mem_Buf_Limit: ''
    Multiline_Flush: ''
    Skip_Long_Lines: ''
  resources: {}
  tolerations:
    - effect: NoSchedule
      key: node-role.kubernetes.io/controlplane
      value: 'true'
    - effect: NoExecute
      key: node-role.kubernetes.io/etcd
      value: 'true'
  security:
    podSecurityPolicyCreate: true
    podSecurityContext:
      runAsNonRoot: true
      runAsUser: 1000
    securityContext:
      runAsNonRoot: true
      runAsUser: 1000  
fluentd:
  bufferStorageVolume: {}
  livenessProbe:
    initialDelaySeconds: 30
    periodSeconds: 15
    tcpSocket:
      port: 24240
  nodeSelector: {}
  resources: {}
  tolerations: {}
  fluentOutLogrotate:
    enabled: false
  security:
    podSecurityPolicyCreate: true
    podSecurityContext:
      runAsNonRoot: true
      runAsUser: 1000
    securityContext:
      runAsNonRoot: true
      runAsUser: 1000    
fullnameOverride: ''
global:
  cattle:
    systemDefaultRegistry: ''
  dockerRootDirectory: ''
  psp:
    enabled: false
  rkeWindowsPathPrefix: c:\
  seLinux:
    enabled: true
http:
  port: 8080
  service:
    annotations: {}
    clusterIP: None
    labels: {}
    type: ClusterIP
image:
  pullPolicy: IfNotPresent
  repository: rancher/mirrored-banzaicloud-logging-operator
  tag: 3.12.0
imagePullSecrets: []
images:
  config_reloader:
    repository: rancher/mirrored-jimmidyson-configmap-reload
    tag: v0.4.0
  fluentbit:
    repository: rancher/mirrored-fluent-fluent-bit
    tag: 1.7.9
  fluentbit_debug:
    repository: rancher/mirrored-fluent-fluent-bit
    tag: 1.7.9-debug
  fluentd:
    repository: rancher/mirrored-banzaicloud-fluentd
    tag: v1.12.4-alpine-1
  nodeagent_fluentbit:
    os: windows
    repository: rancher/fluent-bit
    tag: 1.7.4
monitoring:
  serviceMonitor:
    additionalLabels: {}
    enabled: false
    metricRelabelings: []
    relabelings: []
nameOverride: ''
namespaceOverride: ''
nodeAgents:
  tls:
    enabled: false
nodeSelector:
  kubernetes.io/os: linux
podLabels: {}
podSecurityContext:
  runAsNonRoot: true
  runAsUser: 1000
priorityClassName: {}
rbac:
  enabled: true
  psp:
    annotations:
      seccomp.security.alpha.kubernetes.io/allowedProfileNames: docker/default,runtime/default
      seccomp.security.alpha.kubernetes.io/defaultProfileName: runtime/default
    enabled: true
replicaCount: 1
resources: {}
securityContext:
  allowPrivilegeEscalation: false
  readOnlyRootFilesystem: false
  capabilities:
     drop: ["ALL"]
  runAsNonRoot: true
  runAsUser: 1000
systemdLogPath: /run/log/journal
tolerations:
  - effect: NoSchedule
    key: cattle.io/os
    operator: Equal
    value: linux

Notice that selinux configurations are applied (rancher-selinux is also installed), PodSecurityContext and SecurityContext are configured, but as I said, they are not really working (not applied to neither fluentd dm nor sts). The funny thing here is that rke2-logging is not deployed even it is configured (it will be deployed using default values file, but of course it does not works because PSP features).

I patched rancher-logging with security settings:

kubectl patch logging rancher-logging --patch "$(cat rancher-logging.patch.yaml)" --type=merge

This is the complete patch file:

spec:
  fluentbit:
    security:
      serviceAccount: rancher-logging
  fluentd:
    fluentOutLogrotate:
      enabled: false
    security:
      podSecurityPolicyCreate: true
      podSecurityContext:
        runAsNonRoot: true
        runAsUser: 1000
      serviceAccount: rancher-logging

You can notice that I added "rancher-logging" Service Account. As I realized that rke2-logging wasn´t deployed, I prepared my own logging resource (using customized values from other rke cluster settings, paths will change of course). I just prepared a simple audit-logging logging resource with /var/lib/rancher/rke2/server/logs/audit.log, just to get this log. This is teh content of my audit-logging-rke2.logging.yaml

apiVersion: logging.banzaicloud.io/v1beta1
kind: Logging
metadata:
  labels:
    app.kubernetes.io/name: audit-logging
  name: audit-logging-rke2
spec:
  controlNamespace: logging
  fluentbit:
    security:
      podSecurityPolicyCreate: true
      serviceAccount: rancher-logging
    extraVolumeMounts:
    - destination: /var/lib/rancher/rke2/server/logs
      readOnly: true
      source: /var/lib/rancher/rke2/server/logs
    image:
      repository: rancher/fluent-fluent-bit
      tag: 1.6.4
    inputTail:
      Parser: json
      Path: /var/lib/rancher/rke2/server/logs/audit.log
      Tag: rke2
    nodeSelector:
      kubernetes.io/os: linux
    tolerations:
    - effect: NoSchedule
      key: cattle.io/os
      operator: Equal
      value: linux
    - effect: NoSchedule
      key: node-role.kubernetes.io/controlplane
      value: "true"
    - effect: NoExecute
      key: node-role.kubernetes.io/etcd
      value: "true"
  fluentd:
    fluentOutLogrotate:
      enabled: false
    security:
      podSecurityPolicyCreate: true
      podSecurityContext:
        runAsNonRoot: true
        runAsUser: 1000
      serviceAccount: rancher-logging
    configReloaderImage:
      repository: rancher/jimmidyson-configmap-reload
      tag: v0.2.2
    disablePvc: true
    image:
      repository: rancher/banzaicloud-fluentd
      tag: v1.11.5-alpine-1
    nodeSelector:
      kubernetes.io/os: linux
    tolerations:
    - effect: NoSchedule
      key: cattle.io/os
      operator: Equal
      value: linux

Because PSP are applied and by default hostPath volumes are not allowed to be used I tried adding rancher-logging service account but I am still stuck with following event:

13m         Warning   FailedCreate   daemonset/rancher-logging-rke2-journald-aggregator   Error creating: pods "rancher-logging-rke2-journald-aggregator-" is forbidden: PodSecurityPolicy: unable to admit pod: [spec.volumes[1]: Invalid value: "hostPath": hostPath volumes are not allowed to be used spec.volumes[2]: Invalid value: "hostPath": hostPath volumes are not allowed to be used spec.volumes[3]: Invalid value: "hostPath": hostPath volumes are not allowed to be used]

I found that my settings are not applied:

 kubectl get ds rancher-logging-rke2-journald-aggregator  -o yaml
apiVersion: apps/v1
kind: DaemonSet
metadata:
  annotations:
    deprecated.daemonset.template.generation: "1"
    meta.helm.sh/release-name: rancher-logging
    meta.helm.sh/release-namespace: cattle-logging-system
  creationTimestamp: "2021-09-15T12:41:35Z"
  generation: 1
  labels:
    app.kubernetes.io/managed-by: Helm
  name: rancher-logging-rke2-journald-aggregator
  namespace: cattle-logging-system
  resourceVersion: "16840954"
  uid: 2266df4c-6ed2-4f57-866b-7171446243e2
spec:
  revisionHistoryLimit: 10
  selector:
    matchLabels:
      name: rancher-logging-rke2-journald-aggregator
  template:
    metadata:
      annotations:
        checksum/config: 2f9f5c4dd58a8c52ea3331479642e88da00b897d93b00a91e449ac8bb0895c7c
      creationTimestamp: null
      labels:
        name: rancher-logging-rke2-journald-aggregator
      name: rancher-logging-rke2-journald-aggregator
      namespace: cattle-logging-system
    spec:
      containers:
      - image: rancher/mirrored-fluent-fluent-bit:1.7.9
        imagePullPolicy: IfNotPresent
        name: fluentbit
        resources: {}
        securityContext:
          seLinuxOptions:
            type: rke_logreader_t
        terminationMessagePath: /dev/termination-log
        terminationMessagePolicy: File
        volumeMounts:
        - mountPath: /fluent-bit/etc/
          name: config
        - mountPath: /run/log/journal
          name: journal
          readOnly: true
        - mountPath: /var/lib/rancher/rke2/agent/logs/kubelet.log
          name: kubelet
          readOnly: true
        - mountPath: /etc/machine-id
          name: machine-id
          readOnly: true
      dnsPolicy: ClusterFirst
      nodeSelector:
        kubernetes.io/os: linux
      restartPolicy: Always
      schedulerName: default-scheduler
      securityContext: {}
      serviceAccount: rancher-logging-rke2-journald-aggregator
      serviceAccountName: rancher-logging-rke2-journald-aggregator
      terminationGracePeriodSeconds: 30
      tolerations:
      - effect: NoSchedule
        key: cattle.io/os
        operator: Equal
        value: linux
      volumes:
      - configMap:
          defaultMode: 420
          name: rancher-logging-rke2
        name: config
      - hostPath:
          path: /run/log/journal
          type: ""
        name: journal
      - hostPath:
          path: /var/lib/rancher/rke2/agent/logs/kubelet.log
          type: ""
        name: kubelet
      - hostPath:
          path: /etc/machine-id
          type: ""
        name: machine-id
  updateStrategy:
    rollingUpdate:
      maxSurge: 0
      maxUnavailable: 1
    type: RollingUpdate
status:
  currentNumberScheduled: 0
  desiredNumberScheduled: 0
  numberMisscheduled: 0
  numberReady: 0

Neither Security Context nor Service Account are configured with my settings. Not sure if my values are not valid or just operator does not use them. And the thing is that rancher-logging-rke2-journald-aggregator does not have any PSP associated, but rancher-logging does, with the right hostPath permissions

📙 PSP rancher-logging-fluentd └── 📓 Role cattle-logging-system/rancher-logging-fluentd-psp └── 📓 RoleBinding cattle-logging-system/rancher-logging-fluentd-psp └── 📗 Subject{Kind: ServiceAccount, Name: rancher-logging, Namespace: cattle-logging-system}

>> kubectl get psp rancher-logging-fluentd
Warning: policy/v1beta1 PodSecurityPolicy is deprecated in v1.21+, unavailable in v1.25+
NAME                      PRIV    CAPS   SELINUX    RUNASUSER   FSGROUP     SUPGROUP    READONLYROOTFS   VOLUMES
rancher-logging-fluentd   false          RunAsAny   MustRunAs   MustRunAs   MustRunAs   false            configMap,emptyDir,secret,hostPath,persistentVolumeClaim

And this is where I am stuck now.... sorry for this long and probably chaotic steps and workarounds description.

I will love to make everything work with Rancher´s catalog Logging, but patches and issues found because I think it is not really prepared for secure environments.

If I finally make it work I will write a guide and try to fix/PR the patches and workarounds used.

Let me know if there is something you want me to try.

brandond commented 3 years ago

@frjaraur can you move that info into a new issue on the rancher/rancher project? It doesn't belong on this QA validation issue.

frjaraur commented 3 years ago

Hi, I opened this issue 24 days ago :|, no luck rancher - #343871. I will add all my research and workarounds on these days.