projectcapsule / capsule

Multi-tenancy and policy-based framework for Kubernetes.
https://capsule.clastix.io
Apache License 2.0
1.51k stars 150 forks source link

feat(metrics): provide metrics for tenant quotas #1094

Closed lukasboettcher closed 1 month ago

lukasboettcher commented 1 month ago

Description

This PR adds two custom metrics for capsule tenants:

Usecase

When resourcequotas are configured via capsule at the Tenant scope, capacity planning is difficult via Prometheus metrics from i.e. kube-state-metrics, since the sum of the resourcequotas is not actually what's being enforced. Instead we can provide metrics that expose the aggregated resource limits and usage for a tenant.

Example metrics

Tenant Resource ```yaml apiVersion: capsule.clastix.io/v1beta2 kind: Tenant metadata: name: test spec: owners: - name: alice kind: User namespaceOptions: quota: 10 resourceQuotas: scope: Tenant items: - hard: pods: 100 - hard: limits.memory: 4Gi requests.memory: 4Gi - hard: requests.memory: 6Gi ```
Metrics ```yaml # HELP capsule_tenant_resource_limit Current resource limit for a given resource in a tenant # TYPE capsule_tenant_resource_limit gauge capsule_tenant_resource_limit{resource="limits.memory",resourcequotaindex="1",tenant="test"} 4.294967296e+09 capsule_tenant_resource_limit{resource="namespaces",resourcequotaindex="",tenant="test"} 10 capsule_tenant_resource_limit{resource="pods",resourcequotaindex="0",tenant="test"} 100 capsule_tenant_resource_limit{resource="requests.memory",resourcequotaindex="1",tenant="test"} 4.294967296e+09 capsule_tenant_resource_limit{resource="requests.memory",resourcequotaindex="2",tenant="test"} 6.442450944e+09 # HELP capsule_tenant_resource_usage Current resource usage for a given resource in a tenant # TYPE capsule_tenant_resource_usage gauge capsule_tenant_resource_usage{resource="limits.memory",resourcequotaindex="1",tenant="test"} 2.68435456e+09 capsule_tenant_resource_usage{resource="namespaces",resourcequotaindex="",tenant="test"} 4 capsule_tenant_resource_usage{resource="pods",resourcequotaindex="0",tenant="test"} 20 capsule_tenant_resource_usage{resource="requests.memory",resourcequotaindex="1",tenant="test"} 2.68435456e+09 capsule_tenant_resource_usage{resource="requests.memory",resourcequotaindex="2",tenant="test"} 2.68435456e+09 ```
netlify[bot] commented 1 month ago

Deploy Preview for capsule-documentation canceled.

Name Link
Latest commit 324382873e0e1185fa00bed552f3b5327c94618e
Latest deploy log https://app.netlify.com/sites/capsule-documentation/deploys/6650d95142c5530008d7cd3f
oliverbaehler commented 1 month ago

@lukasboettcher Thanks! Wondering, did you use resourcequotaindex to reduce cardinality instead eg namespace?

lukasboettcher commented 1 month ago

Since it is possible to create multiple resourcequotas for the same resource, I was facing a problem where the metrics were being overwritten. In the example given above, the third entry requests.memory: 6Gi would overwrite the second entry requests.memory: 4Gi in the metrics if we don't account for the index of the quota. Kubernetes itself always enforces the lowest quota, so we need to keep metrics for all tnt.spec.resourceQuotas.items.*. I did not use namespace as a label for the metrics, because they are tenant scoped and should be independent of the namespaces. Metrics for the individual resourcequotas are already computed by i.e. kube-state-metrics.

oliverbaehler commented 1 month ago

Thanks, just fyi we are also working on improving the observability of the tenant resource quota and some kind of mechanism to avoid the racing conditions. One measure is to expose the usage on the tenant spec:

  status:
    namespaces:
    - green-prod
    - green-test
    quota:
      hard:
        limits.cpu: "2"
        limits.memory: 2Gi
        pods: "6"
        requests.cpu: "1"
        requests.memory: 1Gi
      used:
        limits.cpu: 400m
        limits.memory: 1Gi
        pods: "2"
        requests.cpu: 200m
        requests.memory: 256Mi