grafana / alloy

OpenTelemetry Collector distribution with programmable pipelines
https://grafana.com/oss/alloy
Apache License 2.0
1.35k stars 190 forks source link

external_labels.cluster should be more prominently configurable #550

Open flokli opened 2 years ago

flokli commented 2 years ago

I'm using the cluster external label to distinguish metrics/logs pushed from several clusters.

The cluster label seems to get derived from the namespace and resource name of the GrafanaAgent resource.

If I change the resource name of GrafanaAgent to $clustername, resulting pods are somewhat confusingly named ($clustername-0, $clustername-logs-*), and there's no reference to it being grafana-agent anymore. Maybe we could always prefix them with grafana-agent?

There is fields to set the name of the label itself from cluster to something else (metricsExternalLabelName, logsExternalLabelName).

I might be able to set these labels by setting grafanaagent.spec.metrics.externalLabels (if it takes priority, didn't check), but there's no option to set labels for the grafanaagent.spec.logs hierarchy. Maybe this was missed when designing the CRD?

Maybe we should expose a grafanaagent.spec.clusterName field, that's used if set instead of ${grafanaagent.spec.metadata.namespace}/${grafanaagent.spec.metadata.name}?

rfratto commented 2 years ago

Thanks for opening the issue! I'm stoked that you're using the operator and giving really useful feedback. I was able to follow this issue, but I'd like if in the future you give small example YAML snippets of the changes to the CRDs you're proposing. I struggle to remember all the fields in the CRDs, and seeing a small example would help me visualize the suggestions a bit better. Just a light suggestion though, I'm not going to start ignoring you if you don't do this :)

Maybe we could always prefix them with grafana-agent?

Personally I think this is something the user should be responsible for: I'm not sure everyone will want that prefix on the label, and ensuring that the GrafanaAgent CR is named with the prefix seems like a reasonable workaround.

I might be able to set these labels by setting grafanaagent.spec.metrics.externalLabels (if it takes priority, didn't check), but there's no option to set labels for the grafanaagent.spec.logs hierarchy. Maybe this was missed when designing the CRD?

Setting external labels should take predence :)

Unfortunately Promtail and Prometheus put the external label concept in different places: Prometheus adds it globally across all remote_write, while Promtail adds it per client. If you look at the spec for the logging clients, you should see externalLabels there.

flokli commented 2 years ago

I was able to follow this issue, but I'd like if in the future you give small example YAML snippets of the changes to the CRDs you're proposing. I struggle to remember all the fields in the CRDs, and seeing a small example would help me visualize the suggestions a bit better. Just a light suggestion though, I'm not going to start ignoring you if you don't do this :)

Sorry. I was using the notation that can be passed to kubectl explain, assuming that'd give enough context, but realized it's a bit cryptic indeed. I'll add some examples this time. :smile:

Thanks for explaining the history behind the external labels being in different places. IMHO, I'd probably be fine if the operator would abstract this detail away from me, so I would set both on the same hierarchy depth. I don't know if there's customers who want to set this on a per-remote-write level, but if they are, they probably want to do it for both logs and metrics (so it should be PR'ed into prometheus) - and if there aren't, it probably should not be exposed.

As for the cluster label, I'm fine if the name of the GrafanaAgent resource gets picked as a default, if I don't explicitly configure a value.

But we already have grafanaagent.spec.metrics.metricsExternalLabelName and grafanaagent.spec.logs.logsExternalLabelName fields to change "cluster" to something else. It's really unintuitive I need to use the externalLabels loophole to configure the value (grafanaagent.spec.metrics.externalLabels and logsinstances.spec.clients.externalLabels) to override things. Even more confusing it's in a the "parent CR" for metrics, but in the "child CR" for logs.

So my proposal would be to introduce grafanaagent.spec.metrics.metricsExternalLabelValue and grafanaagent.spec.logs.logsExternalLabelValue. I'd even argue we should rename them to {metrics,logs}.clusterLabel{Name,Value}. It's not just a specific external label name, but specifically the one to express the name of the cluster. We can set many external labels in the externalLabels fields - its name is confusingly close to that one.

We also don't need the '{logs,metrics}` prefix, as it's already in its namespace.

As expressed above, for the sake of consistency. drop logsinstances.spec.clients.externalLabels and move it to grafanaagent.spec.logs.externalLabels. There's probably no point in being able to specify per log remote write endpoint labels, if you can't do the same for metrics.

I constructed an example. I used the first 3 resources in cmd/agent-operator/agent-example-config.yaml as a starter, then expressed the following on top:

Before

apiVersion: monitoring.grafana.com/v1alpha1
kind: GrafanaAgent
metadata:
  name: grafana-agent-example
  namespace: default
  labels:
    app: grafana-agent-example
spec:
  image: grafana/agent:latest
  logLevel: info
  serviceAccountName: grafana-agent
  storage:
    volumeClaimTemplate:
      spec:
        resources:
          requests:
            storage: 1Gi
  logs:
    logsExternalLabelName: cluster2
    instanceSelector:
      matchLabels:
        agent: grafana-agent-example
  metrics:
    metricsExternalLabelName: cluster2
    externalLabels:
      cluster2: my-cluster
      foo: bar
    instanceSelector:
      matchLabels:
        agent: grafana-agent-example

---

apiVersion: monitoring.grafana.com/v1alpha1
kind: MetricsInstance
metadata:
  name: primary
  namespace: default
  labels:
    agent: grafana-agent-example
spec:
  remoteWrite:
  - url: http://prometheus:9090/api/v1/write
    basicAuth:
      username:
        name: prometheus-fake-credentials
        key: fakeUsername
      password:
        name: prometheus-fake-credentials
        key: fakePassword
  # Supply an empty namespace selector to look in all namespaces.
  serviceMonitorNamespaceSelector: {}
  serviceMonitorSelector:
    matchLabels:
      instance: primary

---

apiVersion: monitoring.grafana.com/v1alpha1
kind: LogsInstance
metadata:
  name: primary
  namespace: default
  labels:
    agent: grafana-agent-example
spec:
  clients:
  - url: http://loki:8080/loki/api/v1/push
    externalLabels:
      cluster2: my-cluster
      foo: bar

  # Supply an empty namespace selector to look in all namespaces.
  podLogsNamespaceSelector: {}
  podLogsSelector:
    matchLabels:
      instance: primary

After:

apiVersion: monitoring.grafana.com/v1alpha1
kind: GrafanaAgent
metadata:
  name: grafana-agent-example
  namespace: default
  labels:
    app: grafana-agent-example
spec:
  image: grafana/agent:latest
  logLevel: info
  serviceAccountName: grafana-agent
  storage:
    volumeClaimTemplate:
      spec:
        resources:
          requests:
            storage: 1Gi
  logs:
    clusterLabelName: cluster2
    clusterLabelValue: my-cluster
    externalLabels:
      foo: bar
    instanceSelector:
      matchLabels:
        agent: grafana-agent-example
  metrics:
    clusterLabelName: cluster2
    clusterLabelValue: my-cluster
    externalLabels:
      foo: bar
    instanceSelector:
      matchLabels:
        agent: grafana-agent-example

---

apiVersion: monitoring.grafana.com/v1alpha1
kind: MetricsInstance
metadata:
  name: primary
  namespace: default
  labels:
    agent: grafana-agent-example
spec:
  remoteWrite:
  - url: http://prometheus:9090/api/v1/write
    basicAuth:
      username:
        name: prometheus-fake-credentials
        key: fakeUsername
      password:
        name: prometheus-fake-credentials
        key: fakePassword
  # Supply an empty namespace selector to look in all namespaces.
  serviceMonitorNamespaceSelector: {}
  serviceMonitorSelector:
    matchLabels:
      instance: primary

---

apiVersion: monitoring.grafana.com/v1alpha1
kind: LogsInstance
metadata:
  name: primary
  namespace: default
  labels:
    agent: grafana-agent-example
spec:
  clients:
  - url: http://loki:8080/loki/api/v1/push

  # Supply an empty namespace selector to look in all namespaces.
  podLogsNamespaceSelector: {}
  podLogsSelector:
    matchLabels:
      instance: primary

Diff:

@@ -16,14 +16,17 @@
           requests:
             storage: 1Gi
   logs:
-    logsExternalLabelName: cluster2
+    clusterLabelName: cluster2
+    clusterLabelValue: my-cluster
+    externalLabels:
+      foo: bar
     instanceSelector:
       matchLabels:
         agent: grafana-agent-example
   metrics:
-    metricsExternalLabelName: cluster2
+    clusterLabelName: cluster2
+    clusterLabelValue: my-cluster
     externalLabels:
-      cluster2: my-cluster
       foo: bar
     instanceSelector:
       matchLabels:
@@ -66,8 +69,6 @@
 spec:
   clients:
   - url: http://loki:8080/loki/api/v1/push
-    externalLabels:
-      cluster2: my-cluster

   # Supply an empty namespace selector to look in all namespaces.
   podLogsNamespaceSelector: {}
rfratto commented 2 years ago

(FYI, I really appreciate all the detail you gave above. I have it on my task list to try to respond to this ASAP, but I might not be able to give this the attention it deserves until next week)

NissesSenap commented 2 years ago

@rfratto guessing you are at PTO right now but have you had time to look in to this? Is the plan that we should create a RFC for this issue after https://github.com/grafana/agent/pull/1055 is merged?

Really great explanation @flokli saved me allot of time.

rfratto commented 2 years ago

Hey, sorry for the delay on responding, I was on PTO as you've suspected :)

I've read through @flokli's proposal and I'm not opposed. I'm starting to shift my thinking towards agreeing that making usage of the operator more consistent is more important than giving the operator a 1:1 mapping of things you can configure with the agent directly.

I don't think this needs an RFC since it's not a huge change, and this issue can serve as the proposal.

I'll try to re-summarize what's being proposed and CC some relevant people.

rfratto commented 2 years ago

All credit to @flokli for this, I'm just trying to re-summarize in smaller chunks.

Proposal

  1. Allow users to explicitly override the cluster name in CRDs. Today, it is inferred from the name of the Grafana Agent CRD and you may only change the cluster label, but not the value.
  2. Remove per-LogsInstance configuration of external_labels in favor of global settings. This is for consistency with defining metrics globals, which is where its external_labels is defined.

Overriding cluster name

Introduce two new fields into the GrafanaAgent CRD:

The final field names may wish to drop the duplicated metrics/logs prefix since the prefix can be inferred from the parent object name.

Remove per-LogsInstance configuration of external_labels

Remove spec.clients.externalLabels from LogsInstances, and move to GrafanaAgent's spec.logs.externalLabels. This is a breaking change. As @flokli stipulates:

There's probably no point in being able to specify per log remote write endpoint labels, if you can't do the same for metrics.

Example

Refer to the original comment for a diff which demonstrates the changes to the CRDs.

rfratto commented 2 years ago

CC'ing other relevant people for the proposal here in case they have some opinions here: @mattdurham @hjet @rlankfo

Our consensus model is rough consensus, so nobody needs to wait for broader consensus if they are interested in implementing these changes; I'm personally fine with what's described here.

Unfortunately Promtail and Prometheus put the external label concept in different places: Prometheus adds it globally across all remote_write, while Promtail adds it per client. If you look at the spec for the logging clients, you should see externalLabels there.

There's probably no point in being able to specify per log remote write endpoint labels, if you can't do the same for metrics.

@slim-bean Can you explain why Promtail supports setting external_labels per client rather than globally like Prometheus? Is there a specific use case where it makes sense to have different labels per client?

We're considering not exposing that per-client setting in Grafana Agent Operator and I want to make sure I understand the original intent.

mattdurham commented 2 years ago

100% in favor of both 1 and 2. Having consistency between the various subsystems is a win even if not entirely in line with those subsystems original intent.

github-actions[bot] commented 2 years ago

This issue has been automatically marked as stale because it has not had any activity in the past 30 days. The next time this stale check runs, the stale label will be removed if there is new activity. The issue will be closed in 7 days if there is no new activity. Thank you for your contributions!

flokli commented 2 years ago

Not stale.

rfratto commented 2 years ago

We're moving pretty slowly on this since we don't have a lot of people maintaining the operator. I'm going to mark this as keepalive for now.

rfratto commented 2 years ago

This would be likely superseded by grafana/agent#1565; your feedback would be more than welcome on that RFC :)

gregeva commented 9 months ago

Thank you @flokli - your detailed description of the situation as well as your code samples has allowed me to get Grafana Agent operator working when I was getting ready to throw in the towel.