compute: support labelling spot instance as boolean

jjo commented 2 months ago

What

Support e.g. --compute.spot-as-boolean-label=LABEL_NAME (e.g. with _LABEL_NAME_=is_spot), to set such additional label.

Why

As a Spot instance is driven by "upper" level workloads' choice (when setting the nodeSelectors/taints/labels to trigger CA to allocate such specific nodes to them), with workloads explicitly knowing the rewards/risk aspects of spots. This is quite different from ondemand vs RIs (granted the latter are the only ones guaranteed to be available).

In the case of GrafanaLabs, this has grown in our recording rules to be used as spot={true|false}, which is now forcing us to use below relabeling trick (actually add-labelling) to set it:

(
  label_replace(<CSP>_instance_cpu_usd_per_core_hour{price_tier="spot"}, "spot", "true", "", "")
  or
  label_replace(<CSP>_instance_cpu_usd_per_core_hour{price_tier!="spot"}, "spot", "false", "", "")
)

The above needs to be done for every you may run (especially the case if you have a centralized TSDB), note also that you need to visit every timeseries to perform this relabelling (under the context of TSDB pressure), create the two TSs sets, to then OR them.

logyball commented 2 months ago

This issue feels like a fine one to add to me. It does expand the set of labels for compute instances by one per metric, but doesn't add cardinality, and will probably have conveniences down the line. I don't even think we need this as a config, to be honest, we can just straight-up add a spot boolean where:

ondemand and reserved instances (and any other types that cloud providers decide to create in the future) set it to false
spot instances set it to true

Pokom commented 2 months ago

@jjo and I synced up on this yesterday and came to the agreement that the best thing to do here is to simply add a new label called spot that is a boolean value. This label is more of a relic of how we internally compute our TCO metrics, but there's potential value for others as well.

The rationale for a new label is

Minimal overhead added to each time series
Reduced complexity by avoiding an operational toggle

jjo commented 1 month ago

Closing after syncing with @the-it on fully embracing price_tier instead.

grafana / cloudcost-exporter

compute: support labelling spot instance as boolean #263

What

Why