APP.4.4.A19 - Githubissues

ermeratos commented 8 months ago

A Kubernetes operation SHOULD be set up in such a way that if a site fails, the clusters (and thus the applications in the pods) either continue to run without interruption or can be restarted in a short time at another site.

Should a restart be required, all the necessary configuration files, images, user data, network connections, and other resources required for operation (including the necessary hardware) SHOULD already be available at the alternative site.

For the uninterrupted operation of clusters, the control plane of Kubernetes, the infrastructure applications of the clusters, and the pods of the applications SHOULD be distributed across several fire zones based on the location data of the corresponding nodes so that the failure of a fire zone will not lead to the failure of an application.

benruland commented 8 months ago

For the uninterrupted operation of clusters, the control plane of Kubernetes, the infrastructure applications of the clusters, and the pods of the applications SHOULD be distributed across several fire zones based on the location data of the corresponding nodes so that the failure of a fire zone will not lead to the failure of an application.

We could check, if the nodes (potentially individually for master and worker nodes) have labels set for topology.kubernetes.io/zone. This would indicate a distribution of nodes across "fire zones".

sluetze commented 8 months ago

additional we might check if there are multiple masters/workers, missing masters are quite surely an indicator of missing distribution.

while checking masters might be easy, the check of workers might be difficult, because a user could have several nodetypes. maybe we could check each machineconfigset, if the number of selected nodes is higher than 1?.

i cannot identify any checks for this in the upstream

benruland commented 6 months ago

Ongoing implementation in https://github.com/ComplianceAsCode/content/pull/11659

benruland commented 1 month ago

I am unsure, whether to include a rule that checks deployments and statefulsets, if their pods are spread across nodes or zones using anti-affinity and/or topologySpreadConstraints.

While it is technically possible (I have implemented it), it results in a lot of results, e.g.:

[
  "ansible-automation-platform/aap001-hub-content",
  "ansible-automation-platform/aap001-hub-worker",
  "argocd/argocd-dex-server",
  "argocd/argocd-redis",
  "argocd/argocd-repo-server",
  "argocd/argocd-server",
  "app-x/app-x-worker",
  "iwo-collector/iwo-k8s-collector-cisco-intersight",
  "nextcloud/nextcloud-operator-controller-manager",
  "openshift-apiserver-operator/openshift-apiserver-operator",
  "openshift-authentication-operator/authentication-operator",
  "openshift-cloud-controller-manager-operator/cluster-cloud-controller-manager-operator",
  "openshift-cloud-credential-operator/cloud-credential-operator",
  "openshift-cluster-machine-approver/machine-approver",
  "openshift-cluster-node-tuning-operator/cluster-node-tuning-operator",
  "openshift-cluster-samples-operator/cluster-samples-operator",
  "openshift-cluster-storage-operator/cluster-storage-operator",
  "openshift-cluster-storage-operator/csi-snapshot-controller-operator",
  "openshift-cluster-version/cluster-version-operator",
  "openshift-compliance/compliance-operator",
  "openshift-compliance/ocp4-openshift-compliance-pp",
  "openshift-compliance/rhcos4-openshift-compliance-pp",
  "openshift-compliance/upstream-ocp4-bsi-node-master-rs",
  "openshift-compliance/upstream-ocp4-bsi-node-worker-rs",
  "openshift-compliance/upstream-ocp4-bsi-rs",
  "openshift-compliance/upstream-ocp4-openshift-compliance-pp",
  "openshift-compliance/upstream-rhcos4-bsi-master-rs",
  "openshift-compliance/upstream-rhcos4-bsi-worker-rs",
  "openshift-compliance/upstream-rhcos4-openshift-compliance-pp",
  "openshift-config-operator/openshift-config-operator",
  "openshift-console-operator/console-operator",
  "openshift-controller-manager-operator/openshift-controller-manager-operator",
  "openshift-dns-operator/dns-operator",
  "openshift-etcd-operator/etcd-operator",
  "openshift-gitops/cluster",
  "openshift-gitops/kam",
  "openshift-image-registry/cluster-image-registry-operator",
  "openshift-ingress-operator/ingress-operator",
  "openshift-insights/insights-operator",
  "openshift-kube-apiserver-operator/kube-apiserver-operator",
  "openshift-kube-controller-manager-operator/kube-controller-manager-operator",
  "openshift-kube-scheduler-operator/openshift-kube-scheduler-operator",
  "openshift-kube-storage-version-migrator-operator/kube-storage-version-migrator-operator",
  "openshift-kube-storage-version-migrator/migrator",
  "openshift-machine-api/cluster-autoscaler-operator",
  "openshift-machine-api/cluster-baremetal-operator",
  "openshift-machine-api/control-plane-machine-set-operator",
  "openshift-machine-api/machine-api-operator",
  "openshift-machine-config-operator/machine-config-controller",
  "openshift-machine-config-operator/machine-config-operator",
  "openshift-marketplace/marketplace-operator",
  "openshift-monitoring/cluster-monitoring-operator",
  "openshift-monitoring/kube-state-metrics",
  "openshift-monitoring/openshift-state-metrics",
  "openshift-monitoring/prometheus-operator",
  "openshift-monitoring/telemeter-client",
  "openshift-multus/multus-admission-controller",
  "openshift-network-diagnostics/network-check-source",
  "openshift-operator-lifecycle-manager/catalog-operator",
  "openshift-operator-lifecycle-manager/olm-operator",
  "openshift-operator-lifecycle-manager/package-server-manager",
  "openshift-operators/gitlab-runner-gitlab-runnercontroller-manager",
  "openshift-operators/gitops-operator-controller-manager",
  "openshift-operators/pgo",
  "openshift-service-ca-operator/service-ca-operator",
  "openshift-service-ca/service-ca",
  "redhat-ods-applications/data-science-pipelines-operator-controller-manager",
  "redhat-ods-applications/etcd",
  "redhat-ods-applications/notebook-controller-deployment",
  "redhat-ods-applications/odh-notebook-controller-manager",
  "redhat-ods-operator/rhods-operator",
  "trident/trident-controller",
  "trident/trident-operator"
]

When filtering for deployments that have > 1 replicas I get:

[
  "ansible-automation-platform/aap001-hub-content",
  "ansible-automation-platform/aap001-hub-worker",
  "argocd/argocd-repo-server",
  "argocd/argocd-server",
  "app-x/app-x-worker",
  "iwo-collector/iwo-k8s-collector-cisco-intersight",
  "openshift-multus/multus-admission-controller"
]

I believe for a multitude of deployments, it is totally valid to not configure high availability and restarts are sufficient... Making exclusion configurable is possible but will likely be painful.

Need input @ermeratos @sluetze ! Options I see: a) Do not include a rule at all b) Only consider deployments that have >1 replicas -> Those are intended for HA and should hence be spread evenly c) Consider all deployments and statefulsets, make exclusion configurable

-> I have implemented Variant b) with configurable exclusion (c) for now

sluetze commented 1 month ago

As our customers tend to want to have a rule rather than not having it (they can tailor it out at any time) and you already have done the implementation work I would go with b + c. The exclusion seems to be necessary for such rules, as we had several occurences of hard-coded exclusions which needed to become configurable afterwards.

benruland commented 1 month ago

During rebasing, I accidentially closed the previous PR. For better reviewability, I created a new PR: https://github.com/ComplianceAsCode/content/pull/12155

sig-bsi-grundschutz / content

APP.4.4.A19 #45