compute: support using `node` instead/above of `instance` label

jjo commented 3 months ago

What

Either change compute modules' instance to node, or support e.g. --compute.vm-instance-label=LABEL_NAME.

Why

Couple main reasons:

1) When using cloudcost-exporter metrics in Kubernetes, to be able to join against kube_.* metrics (like kube_node_labels, kube_node_annotations, kube_node_status_capacity, kube_node_status_allocatable, ... see full list below), you then are forced to relabel instance to node with a construct like

  label_join( <CSP>_instance_cpu_usd_per_core_hour{}, "node", "", "exported_instance")

The above needs to be done for every <CSP> you may run (especially the case if you have a centralized TSDB), note also that you need to visit every timeseries to perform this relabelling (under the context of TSDB pressure).

2) Wisely, cloudcost-exporter already chose cluster_name instead of cluster (this being a popular _labelname choice, then avoiding the scrape-time rename to exported_cluster), then why keeping instance that by design scraping agents will always rename to exported_instance (thus the example relabelling query above).

Appendix

Pasting the full list of Kubernetes metrics using node, from this query count by (__name__)({__name__=~"kube_.+", node!=""})

kube_node_annotations
kube_node_created
kube_node_info
kube_node_labels
kube_node_spec_taint
kube_node_spec_unschedulable
kube_node_status_allocatable
kube_node_status_allocatable_cpu_cores
kube_node_status_allocatable_memory_bytes
kube_node_status_capacity
kube_node_status_capacity_cpu_cores
kube_node_status_capacity_memory_bytes
kube_node_status_condition
kube_pod_container_resource_limits
kube_pod_container_resource_requests
kube_pod_info
kube_node_status_addresses
kube_pod_init_container_resource_limits
kube_pod_init_container_resource_requests

logyball commented 3 months ago

I'm hesitant about replacing the instance label with node, as this couples our definition of "machine" with "machine provisioned by kubernetes". Since we want to make this more agnostic of how the machine was provisioned, I think that node wouldn't be the right choice here.

That being said, this does make sense as an optional config argument, as it can then be used for convenience for a big (and possibly majority) use-case: managed Kubernetes.

I think this brings up an interesting point, since the instance label will often be replaced by Prometheus Scrapers and rewritten as exported_instance (as the instance of the job running cloudcost-exporter is already using the instance label). Thoughts on:

config value for the label meaning "name of machine"?
renaming the instance label to something agnostic of how it was provisioned, like machine_name?

jjo commented 3 months ago

I'm hesitant about replacing the instance label with node, as this couples our definition of "machine" with "machine provisioned by kubernetes". Since we want to make this more agnostic of how the machine was provisioned, I think that node wouldn't be the right choice here.

That being said, this does make sense as an optional config argument, as it can then be used for convenience for a big (and possibly majority) use-case: managed Kubernetes.

I think this brings up an interesting point, since the instance label will often be replaced by Prometheus Scrapers and rewritten as exported_instance (as the instance of the job running cloudcost-exporter is already using the instance label). Thoughts on:

config value for the label meaning "name of machine"?

renaming the instance label to something agnostic of how it was provisioned, like machine_name?

+1 moving away from instance, I like machine_name as the default value for such e.g. --compute.vm-instance-label=LABEL_NAME (or alike, naming-is-hard always hunting us).

Pokom commented 3 months ago

I am opposed to having labels defined via command line arguments for the following reasons:

It implies we got the names of labels incorrect and we're masking that to ease PromQL creation
Potential of running many collectors with different labels at the same time which means we can't guarantee the use of a metric with a specific query
Exposing us to having a configuration for every label and having a massive config struct to run the application

I'd really like for us to get alignment on what the root problem is that we're trying to address(In this case, instance doesn't appear to be the appropriate label, it should be node), and then agree on a solution. Whichever path we take is going to require us to rewrite our internal PromQL queries and do a migration of the metrics.

jjo commented 3 months ago

I am opposed to having labels defined via command line arguments for the following reasons:

It implies we got the names of labels incorrect and we're masking that to ease PromQL creation

It's not only easying promQL, it's actually decreasing the load into TSDB, for an arbitrary renaming that will likely be always needed (should we keep instance as its name).

Potential of running many collectors with different labels at the same time which means we can't guarantee the use of a metric with a specific query

(?) sure, as operator you could always shoot yourself in the foot, nevertheless top->down label setting currently only allows one entry point for it.

Exposing us to having a configuration for every label and having a massive config struct to run the application

Didn't ever propose that -- it's this particular one I'm concerned, as we do know [*] we have disjoint sets of metrics (ec2, kube) that inevitably use either instance (sorry for that one, as it'll always be exported_instance unless changed at scrape time), xor node.

I'd really like for us to get alignment on what the root problem is that we're trying to address(In this case, instance doesn't appear to be the appropriate label, it should be node), and then agree on a solution. Whichever path we take is going to require us to rewrite our internal PromQL queries and do a migration of the metrics.

Already replied in the very previous paragraph[*]

jjo commented 3 months ago

I am opposed to having labels defined via command line arguments for the following reasons:

It implies we got the names of labels incorrect and we're masking that to ease PromQL creation

It's not only easying promQL, it's actually decreasing the load into TSDB, for an arbitrary renaming that will likely be always needed (should we keep instance as its name).

Potential of running many collectors with different labels at the same time which means we can't guarantee the use of a metric with a specific query

(?) sure, as operator you could always shoot yourself in the foot, nevertheless top->down label setting currently only allows one entry point for it.

Exposing us to having a configuration for every label and having a massive config struct to run the application

Didn't ever propose that -- it's this particular one I'm concerned, as we do know [*] we have disjoint sets of metrics (ec2, kube) that inevitably use either instance (sorry for that one, as it'll always be exported_instance unless changed at scrape time), xor node.

I'd really like for us to get alignment on what the root problem is that we're trying to address(In this case, instance doesn't appear to be the appropriate label, it should be node), and then agree on a solution. Whichever path we take is going to require us to rewrite our internal PromQL queries and do a migration of the metrics.

Already replied in the very previous paragraph[*]

Indeed, one aspect that I'll really like cloudcost-exporter to shine when compared with other existing solutions, is being as "friendly" as possible to the TSDB, also requiring a lesser amount of recording-rules / scraping artifacts (relabeling and building new derived metrics), for obvious reasons.

grafana / cloudcost-exporter

compute: support using `node` instead/above of `instance` label #262

What

Why

Appendix