Add metric to track the size and labels of individual disks

paulajulve commented 2 months ago

Re: https://github.com/grafana/deployment_tools/issues/111858

Once the disks are detached from the cluster, the only way back to knowing where they belonged is by using the namespace and the zone. But some namespaces exist in different zones/clusters (ie, generic names like hosted-exporters). The current unused_disks_size_gb metric was adding up the sizes of all disks under the same namespace. In the case of unique namespaces, that's alright. But in the case of those that span multiple regions, it can lead to costs being associated to the wrong cluster, and somewhat difficult to interpret data.

To make it easier for future us, I'm adding a new unused_disk_size_bytes metric here, that publishes the size data per disk, instead of added up. It still carries the k8s_namespace, type, and zone labels, since we'll need all of them to be able to attribute costs appropriately.

I've also opted for bytes as the unit, since the size field is not uniform across providers. Azure and GCP offer the size of the volume in GB, while AWS uses GiB. Even though they all end up billing per GB 🙄

I've tested this out locally both for AWS and GCP. I have not tested it for Azure. I'll work on that now.

paulajulve commented 2 months ago

Oh, I've just realised this one will require updating the readme also. I'll work on that.

In the meantime, @inkel , I'm slightly concerned about this warning here. Does this still hold true? I don't fully understand what it means.

inkel commented 2 months ago

Oh, I've just realised this one will require updating the readme also. I'll work on that.

In the meantime, @inkel , I'm slightly concerned about this warning here. Does this still hold true? I don't fully understand what it means.

@paulajulve I believe it still does. That particular comment refers to adding the disk ID as a label as the main culprit to leading to cardinality explosion of metrics. Say for instance you have 1,000 unused disks on each provider, this means you'll end up with a label with 1,000 times the number of provider unique values. As per this article label cardinality should be as low as possible if we don't want to hog down Prometheus.

paulajulve commented 2 months ago

I could update the "unused disk found" log to include the k8s_namespace too (and maybe use bytes instead of GB/GiB) and use unwrap more or less like this (explore):

sum by (k8s_namespace, zone) (
  sum_over_time(
    {namespace="unused-exporter", container!="eventrouter"}
      | logfmt
      | msg="unused disk found"
      | unwrap size_bytes [1m])
  )

I'm unclear on next steps tho 🤔 we want to add this to our TCO recording rules. Can we add this expression to a prometheus recording rule? I've seen loki based alerts, but I'm not sure we can mix and match like this out of the box.

jjo commented 2 months ago

I'm unclear on next steps tho 🤔 we want to add this to our TCO recording rules. Can we add this expression to a prometheus recording rule? I've seen loki based alerts, but I'm not sure we can mix and match like this out of the box.

As these are metrics we're adding, it'll be a matter of juggling how we could come up with a recording rule to "fill the gap(s)" of the missing labels to join, cluster in particular as a very relevant one.

Crafting a promQL to join the metrics we're adding to some existing ones (note there I mocked 3 existing disks ab-using absent(...), as if those metrics were generated by unused exporter after merging this PR):

/explore

# Ultimate goal: account (aggregate) disk GB by: cluster,namespace,team
#
# NOTE: all label_join() calls are there to "normalize" existing labels towards: cluster, namespace, zone
#
sum by (cluster, namespace, team)(
    # Mock 3 disks on respective ns,zone with 10GB, 10GB, 100GB
    label_join(  # k8s_namespace -> namespace
        sum by (k8s_namespace, zone)(
            10e9 * absent(
                unused_disk_size_bytes{k8s_namespace="loki-ops", type="ssd", zone="us-east4", id="what-ever-001"}
            ) or
            10e9 * absent(
                unused_disk_size_bytes{k8s_namespace="loki-ops", type="ssd", zone="us-east4", id="what-ever-002"}
            ) or
            100e9 * absent(
                unused_disk_size_bytes{k8s_namespace="cortex-prod-10", type="ssd", zone="us-central1", id="what-ever-002"}
            )
        ), "namespace", "", "k8s_namespace"
    )

    # Join the (few) labels we have from `unused_disk_size_bytes` metric against those from tanka
    * on (namespace, zone) group_right()

    label_join(         # location -> zone
        label_join(     # exported_cluster -> cluster
            label_join( # exported_namespace -> namespace

                # HACK: (just) pick _one_ cluster from the zone, as `unused_disk_size_bytes` doesn't have "cluster" label
                max by (exported_cluster, exported_namespace, location, team)(
                    tanka_environment_info
                    * on (exported_cluster) group_left(location)
                    tanka_cluster_info
                ), "namespace", "", "exported_namespace"
            ), "cluster", "", "exported_cluster"
        ), "zone", "", "location"
    )
) / 1e9 # show the sum in GB

ends up showing:

Then, with the (cluster, namespace, team) grouping we should be ready to aggregate them into a cost recording-rule, as decided in Design Doc: include unused disks on TCO d/doc

jjo commented 2 months ago

As explained on the main comments section, I'm concerned about the cardinality of the new metric, in particular in the disk label.

Given that the information was logged we could use Loki and its unwrap LogQL keyword to create values from these, instead of loading Prometheus.

IMO these disk IDs live in the ~same playground as pod IDs (which of course we already carry from KSM metrics), and I think we may need them as metrics, ready there to be able to create tables (showing ID, size) for teams to prioritize their cleanup. Regarding cardinality, I'm not much worried about it in this use case, because there's no O(n²) (or worse) as there's 1:1 ID<->disk (plus other labels) without further permutations, and we just have a thousand ~ish of these.

grafana / unused

Add metric to track the size and labels of individual disks #91