grafana / alloy

OpenTelemetry Collector distribution with programmable pipelines
https://grafana.com/oss/alloy
Apache License 2.0
1.25k stars 157 forks source link

mimir.rules.kubernetes does not support federated rule groups source_tenants #216

Open GroovyCarrot opened 5 months ago

GroovyCarrot commented 5 months ago

Request

Mimir ruler supports federated rule groups via source_tenants, however it is not supported in the mimir.rules.kubernetes arguments.

This could easily be added to the river conf, e.g.

mimir.rules.kubernetes "tenant1" {
  address = "http://mimir-ruler:8080"

  tenant_id = "tenant1"
  source_tenants = [
    "tenant1",
    "tenant2",
  ]
}

Alternatively, and more flexibly, it could be obtained from an annotation on the prometheusrules resource:

apiVersion: monitoring.coreos.com/v1
kind: PrometheusRule
metadata:
  name: tenant1-group1
  annotations:
    monitoring.grafana.com/source_tenants: tenant1,tenant2
spec:
  groups:
  - ...

Use case

For a tenant to query the cluster metrics, i.e. pod cpu/memory usage, they need to be able to query the platform metrics tenant as well as themselves.

GroovyCarrot commented 5 months ago

I've raised https://github.com/grafana/agent/pull/6669 which adds support for the monitoring.grafana.com/source_tenants crd annotation, as federating every query for a tenant via the river config didn't make sense to me.

bcrisp4 commented 4 months ago

I am also very interested in this feature.

QuentinBisson commented 1 month ago

I would think it makes more sense that grafana or another party actually creates a new set of CRs for this use case because PrometheusRules are being used for Loki (without prometheus operator validation) where it would probably be "easy" to have something like an Alerting and RecordingRule CRD similar to what is done here https://github.com/grafana/loki/pull/5985/files#diff-76ed4f38ca6a465f642b0789e620c9df313b0238addf2cc7353a9980011f7674R129 in alloy that would have sourceTenants as a field in the spec instead of an annotation?

In the same way, the rule could have a backend field with type loki, prometheus, tempo (once tempo supports rules) or logql, promql and traceql for instance so it would be easy to build a validating webhook that could parse the expression based on the backend type.

@rfratto do you think this idea makes somewhat sense?

GroovyCarrot commented 1 month ago

hmm I've also just found the prometheusrules sync doesn't appear to work at all in alloy?

If I go to the alloy UI /component/mimir.rules.kubernetes.rules to try to debug it I get a blank page on 1.2.0 and 1.3.0 with the following panic:

image

2024/08/08 07:54:20 http: panic serving 127.0.0.1:37404: runtime error: invalid memory address or nil pointer dereference
goroutine 158762077 [running]:
net/http.(*conn).serve.func1()
    /usr/local/go/src/net/http/server.go:1898 +0xbe
panic({0x6d463c0?, 0xd9cbb60?})
    /usr/local/go/src/runtime/panic.go:770 +0x132
github.com/grafana/alloy/internal/component/mimir/rules/kubernetes.(*eventProcessor).getMimirState(0x0)
    /src/alloy/internal/component/mimir/rules/kubernetes/events.go:240 +0x3f
github.com/grafana/alloy/internal/component/mimir/rules/kubernetes.(*Component).DebugInfo(0xc002d76008)
    /src/alloy/internal/component/mimir/rules/kubernetes/debug.go:26 +0x55
github.com/grafana/alloy/internal/runtime/internal/controller.(*BuiltinComponentNode).DebugInfo(0x6c78400?)
    /src/alloy/internal/runtime/internal/controller/node_builtin_component.go:420 +0xb2
github.com/grafana/alloy/internal/runtime.(*Runtime).getComponentDetail(0xc00047eff0, {0x92d6918, 0xc002c87688}, 0xc015543ce0, {0xf0?, 0xb8?, 0xc2?, 0x16?})
    /src/alloy/internal/runtime/alloy_components.go:128 +0xab4
github.com/grafana/alloy/internal/runtime.(*Runtime).GetComponent(0xc00047eff0, {{0x0, 0x0}, {0xc00c424a9b, 0x1c}}, {0xf8?, 0x29?, 0x38?, 0xc8?})
    /src/alloy/internal/runtime/alloy_components.go:37 +0x2b4
github.com/grafana/alloy/internal/web/api.(*AlloyAPI).RegisterRoutes.(*AlloyAPI).getComponentHandler.func3({0x926f570, 0xc015543cc0}, 0xc0040f66c0)
    /src/alloy/internal/web/api/api.go:79 +0xa6
net/http.HandlerFunc.ServeHTTP(0x926f510?, {0x926f570?, 0xc015543cc0?}, 0xc0109f7188?)
    /usr/local/go/src/net/http/server.go:2166 +0x29
github.com/prometheus/prometheus/util/httputil.CompressionHandler.ServeHTTP({{0x920bdc0?, 0xc003842490?}}, {0x926f510?, 0xc011175f20?}, 0xc0040f66c0)
    /go/pkg/mod/github.com/grafana/prometheus@v1.8.2-0.20240514135907-13889ba362e6/util/httputil/compression.go:91 +0x63
github.com/gorilla/mux.(*Router).ServeHTTP(0xc003854300, {0x926f510, 0xc011175f20}, 0xc0062dcd80)
    /go/pkg/mod/github.com/gorilla/mux@v1.8.1/mux.go:212 +0x1e2
go.opentelemetry.io/contrib/instrumentation/github.com/gorilla/mux/otelmux.traceware.ServeHTTP({{0x7ec461a, 0x5}, {0x92353c8, 0xc00386e140}, {0x92695f0, 0xc002867d40}, {0x920c700, 0xc003854300}, 0x85d7400, 0x0, ...}, ...)
    /go/pkg/mod/go.opentelemetry.io/contrib/instrumentation/github.com/gorilla/mux/otelmux@v0.45.0/mux.go:179 +0xbb2
github.com/gorilla/mux.(*Router).ServeHTTP(0xc003854240, {0x9263770, 0xc0039421c0}, 0xc0062dc000)
    /go/pkg/mod/github.com/gorilla/mux@v1.8.1/mux.go:212 +0x1e2
golang.org/x/net/http2/h2c.h2cHandler.ServeHTTP({{0x920c700?, 0xc003854240?}, 0xc00384c240?}, {0x9263770, 0xc0039421c0}, 0xc0062dc000)
    /go/pkg/mod/golang.org/x/net@v0.26.0/http2/h2c/h2c.go:125 +0x697
net/http.serverHandler.ServeHTTP({0xc016c2b6b0?}, {0x9263770?, 0xc0039421c0?}, 0x6?)
    /usr/local/go/src/net/http/server.go:3137 +0x8e
net/http.(*conn).serve(0xc01e112d80, {0x9288ba8, 0xc00384fc80})
    /usr/local/go/src/net/http/server.go:2039 +0x5e8
created by net/http.(*Server).Serve in goroutine 530
    /usr/local/go/src/net/http/server.go:3285 +0x4b4

also upgrading from 1.2.0 to 1.3.0 seems to break clustering randomly, making this version unusable? {"ts":"2024-08-08T09:49:12.182927738Z","level":"error","msg":"fatal error: failed to get peers to join at startup - this is likely a configuration error","service":"cluster","err":"static peer discovery: failed to find any valid join addresses: failed to extract host and port: address cluster-0.mimir.svc.cluster.local: missing port in address\nfailed to resolve SRV records: cluster-0.mimir.svc.cluster.local on 192.168.0.10:53: no such host"}

we use a headless service + a deployment.

▶ kubectl get services/cluster-0 -o yaml
apiVersion: v1
kind: Service
metadata:
  name: cluster-0
  namespace: mimir
spec:
  clusterIP: None
  clusterIPs:
  - None
  internalTrafficPolicy: Cluster
  ipFamilies:
  - IPv4
  ipFamilyPolicy: SingleStack
  ports:
  - name: http
    port: 8080
    protocol: TCP
    targetPort: http
  selector:
    app.kubernetes.io/instance: cluster-0
    app.kubernetes.io/name: grafana-agent
  sessionAffinity: None
  type: ClusterIP
GroovyCarrot commented 3 weeks ago

I found the solution to the first issue, it seems that rbac for the prometheusrules crd for the agents had not been added to the cluster. There didn't seem to be a log line for the access denied, but adding the rbac does fix the ui and show it as healthy. image

The second issue i'll keep looking into but as it stands we can't upgrade to 1.3.0 as it causes a lot of failures for agents starting up, not in all cases, but still in several.


To the original issue: this feature has become more immediately important to us, and I'd rather not maintain a fork for an implementation that isn't desirable.

If we can agree on a design I can implement it? If we want a custom CRD with the source_tenants field on it, then that can be done though will be a breaking change, probably require creating a new component and deprecating the existing one, and then will be some pain to get existing users to migrate to it.

Alternatively if the annotation is an acceptable stepping stone for this component, I can rework that implementation for alloy providing we would merge it?

@QuentinBisson @rfratto what do you think?

QuentinBisson commented 2 weeks ago

I personally think having a custom rule for alloy makes more sense in the long run, and alloy could even generate the alertingrule CR based on the PrometheusRules CR to keep the feature parity for a lot of different reasons (promql,logql validation and semantics for instance but also the prometheus operator brings new features that would not need to be also maintained in alloy) but having an intermediate step with the annotation looks like a good middle ground for now which could even be translated later to other CRDs if the alloy team thinks it makes sense.

Also, I do not work at Grafana so I would strongly advise you to wait for one of the alloy maintainer's opinion on this :)