Generic storage specific alerts

shtripat commented 5 years ago

The following alerts are being proposed at generic storage level

PVCHardLimitNearingFull - warning (80%), critical (90%)

maps to requests.storage (Across all persistent volume claims, the sum of storage requests cannot exceed this value) storage resource quota as specified in https://kubernetes.io/docs/concepts/policy/resource-quotas/#enabling-resource-quota. As this is cluster-wide, having some kind of alert to indicate when PVC storage requests (capacity) is running out is important to know so action (e.g. adding capacity, reclaiming space, etc.) can be taken. Filling up to maximum capacity is usually not a good idea as this can lead to potential undesirable situations, e.g. performance degradation, instability, etc. If underlying storage is on AWS or any public cloud provider, this usually means expanding the underlying volume (and possibly restarting some instances), reclaiming space, or other data offload technique. The same is true for Gluster (OCS) and Ceph. For on-prom storage subsystems, this may mean ordering additional disks in order to support the expansion, as well as procurement process (which may or may not be applicable in the public cloud). Note: in a single OCP cluster, this typically involves multiple storage subsystems, and in this particular scenario, this could mean expansion in 1 or more storage subsystems. 80% utilization is meant as an early warning to start taking action to prevent severe issues 90% utilization is much more severe/critical requiring more immediate action by the admin/operator. The label of alert “PVCHardLimitNearingFull” is suggested as with the words “hard limit” as requests is confusing to users, and the CPU and Memory quota terminology differs from the Persistent Storage quota terminology (though the Ephemeral Storage quota terminology seems more aligned with the CPU and Memory quota terminology).

StorageClass.PVCHardLimitNearingFull - warning (80%), critical (90%)

maps to .storageclass.storage.k8s.io/requests.storage (Across all persistent volume claims associated with the storage-class-name, the sum of storage requests cannot exceed this value) storage resource quota as specified in https://kubernetes.io/docs/concepts/policy/resource-quotas/#enabling-resource-quota This is similar to the requests.storage and differs in that this is in-context of a storage class. As this is tied to a single storage provisioner, which basically is either underlying storage on a public cloud provider or storage subsystem (e.g. Gluster/OCS, Ceph, AWS EBS, etc.), once again, the admin is having to take action: expand the storage (which may or may not be a disruptive operation), go through a procurement process (if applicable) figuring out ways to offload the existing storage (reclamation, archiving, deleting data, migrate to something that's bigger, etc.). If data is getting offloaded, once again, the admin has to communicate with the users to let them know or have the users take action.

StorageClass.PVCCountNearing Full - warning (80%), critical (90%)

maps to .storageclass.storage.k8s.io/persistentvolumeclaims (Across all persistent volume claims associated with the storage-class-name, the total number of persistent volume claims that can exist in the namespace) storage resource quota as specified in https://kubernetes.io/docs/concepts/policy/resource-quotas/#enabling-resource-quota This is less worrying but nevertheless still relevant as this refers to the count of PVCs. If one runs out, the user will be unable to make requests. The 80% is just a warning to the admin/operator to either increase the allotted number, or to look into reclamation (if not automatic), or to ask users to to remove unneeded PVCs. 90% just means it’s more urgent, and more likelihood that the developer/consumer is going to experience issues with requesting storage if the issue is not addressed.

Namespace.PVCCountNearingFull - warning (80%), critical (90%).

This maps to persistentvolumeclaims

Namespace.EphemeralStorageLimitNearingFull - warning (80%), critical (90%)

This maps to maps to limits.ephemeral-storage

NodeDiskRunningFull

This should apply to any node (not just Device of node-exporter Namespace/Pod) and when it will be full

This relates to filesystems (see https://github.com/coreos/prometheus-operator/blob/master/contrib/kube-prometheus/manifests/prometheus-rules.yaml), though the alert label says it’s about a disk (which I found confusing).

For filesystems, utilization beyond 90% is usually not good, but the suggestion is to keep the threshold at 85% like with the existing alert since this is to kick-in after the kubelet garbage collection, which kicks in somewhere at 80-85% (default per https://kubernetes.io/docs/concepts/cluster-administration/kubelet-garbage-collection/#container-collection).

metalmatze commented 5 years ago

Thanks for this report. We already have something like this in the mixins. Instead of relying on the percentages to be the main alerting factor, we rely on the predict_linear for a disks/volumes usage. It is possible for disks to be at 80.000% and within 24 the usage only grows to 80.001%. We really don't care about that. Instead we want to alert when something is really happening to the disk usage and that it will run full within a give time period. Enter predict_linear.

https://github.com/coreos/prometheus-operator/blob/master/contrib/kube-prometheus/manifests/prometheus-rules.yaml#L629-L645

Does that resolve your issue?

shtripat commented 5 years ago

@metalmatze kind of makes sense to me. Thanks for pointing to the exact alerting rules. @julienlim need inputs from you as this came up as suggestions from UX team.

kubernetes-monitoring / kubernetes-mixin

Generic storage specific alerts #125