grafana / alloy

OpenTelemetry Collector distribution with programmable pipelines
https://grafana.com/oss/alloy
Apache License 2.0
1.38k stars 201 forks source link

Alloy Health integration does not register the pre-bundled alert rules #1446

Open edtshuma opened 2 months ago

edtshuma commented 2 months ago

What's wrong?

I have enabled Grafana Alloy Health Integrations. As per the documentation if this is enabled the deployment will also contain some default alerts. I cannot see any of these alerts in my environment.

Steps to reproduce

Install Grafana Alloy Helm Chart with the following version:

`apiVersion: helm.toolkit.fluxcd.io/v2beta2 kind: HelmRelease metadata: name: alloy-cluster namespace: monitoring spec: chart: spec: chart: 3rdparty/grafana/alloy sourceRef: kind: HelmRepository name: orion namespace: flux-system version: '0.6.0' dependsOn:

Check for AlertRules using kubectl or in Grafana (Mimir) UI under Alert rules.

System information

Server Version: v1.29.6-eks-db838b0

Software version

Grafana Alloy v1.3.0

Configuration

logging {
  level    = "debug" // during initial rollout, then switch to info or higher
  format   = "logfmt"
  write_to = [loki.process.cluster.receiver]
}

prometheus.exporter.self "default" { }

prometheus.scrape "alloy" {
  targets    = prometheus.exporter.self.default.targets
  forward_to = [prometheus.relabel.default.receiver]
}

loki.process "cluster" {
  forward_to = [loki.write.default.receiver]

  stage.static_labels {
    values = {
      cluster_name = "${cluster_name}",
    }
  }
}

loki.source.api "default" {
  http {
    listen_port = 3100
  }

  forward_to = [loki.process.cluster.receiver]

  use_incoming_timestamp = true
}

loki.source.kubernetes "pod_logs" {
  targets    = discovery.relabel.pod_logs.output
  forward_to = [loki.process.cluster.receiver]

  clustering {
    enabled = true
  }
}

loki.write "default" {
  // Send metrics to a locally running Loki.
  endpoint {
    url     = "http://loki-gateway/loki/api/v1/push"
    headers = {
      "X-Scope-OrgID" = "fake",
    }

    max_backoff_retries = 288 // 288 * max_backoff_period (5m) = 24h
  }
}

// Metrics / Mimir / Prometheus
prometheus.receive_http "default" {
  // listen on port 9090 and forward metrics for cleanup
  http {
    listen_port = 9090
  }

  forward_to = [prometheus.relabel.default.receiver]
}

prometheus.relabel "cluster" {
  forward_to = [prometheus.remote_write.default.receiver]

  rule {
    action       = "replace"
    replacement  = "${cluster_name}"
    target_label = "cluster_name"
  }
}

prometheus.remote_write "default" {
  // Send metrics to a locally running Mimir.
  endpoint {
    url = "http://mimir-gateway/api/v1/push"
  }
}

prometheus.operator.podmonitors "default" {
  forward_to = [prometheus.relabel.default.receiver]

  clustering {
    enabled = true
  }
}

prometheus.operator.servicemonitors "default" {
  forward_to = [prometheus.relabel.default.receiver]

  clustering {
    enabled = true
  }
}

otelcol.receiver.jaeger "default" {
  protocols {
    grpc { }

    thrift_http { }

    thrift_binary { }

    thrift_compact { }
  }

  output {
    traces = [otelcol.processor.memory_limiter.default.input]
  }
}

otelcol.receiver.otlp "default" {
  http { }

  grpc { }

  output {
    metrics = [otelcol.processor.memory_limiter.default.input]
    logs    = [otelcol.processor.memory_limiter.default.input]
    traces  = [otelcol.processor.memory_limiter.default.input]
  }
}

otelcol.receiver.otlp "public" {
  http {
    endpoint = "0.0.0.0:4139"

    include_metadata = true

    cors {
      allowed_headers = ["Content-type"]
      allowed_origins = ["*"]
    }
  }

  output {
    metrics = [otelcol.processor.attributes.public.input]
    logs    = [otelcol.processor.attributes.public.input]
    traces  = [otelcol.processor.attributes.public.input]
  }
}

otelcol.receiver.zipkin "default" {
  output {
    traces = [otelcol.processor.memory_limiter.default.input]
  }
}

// Processors
otelcol.processor.attributes "cluster" {
  action {
    key    = "otelendpoint"
    action = "insert"
    value  = "${cluster_name}"
  }
  action {
    key          = "http.client_ip"
    action       = "upsert"
    from_context = "X-Forwarded-For"
  }

  output {
    metrics = [otelcol.exporter.prometheus.default.input]
    logs    = [otelcol.exporter.loki.default.input]
    traces  = [otelcol.exporter.otlp.default.input]
  }
}

otelcol.processor.attributes "public" {
  action {
    key    = "cluster_name"
    action = "insert"
    value  = "${cluster_name}"
  }

  output {
    metrics = [otelcol.processor.memory_limiter.default.input]
    logs    = [otelcol.processor.memory_limiter.default.input]
    traces  = [otelcol.processor.memory_limiter.default.input]
  }
}

otelcol.processor.batch "default" {
  output {
    metrics = [otelcol.processor.attributes.cluster.input]
    logs    = [otelcol.processor.attributes.cluster.input]
    traces  = [otelcol.processor.attributes.cluster.input]
  }
}

otelcol.processor.memory_limiter "default" {
  check_interval = "5s"
  limit = "2GiB"
  output {
    metrics = [otelcol.processor.batch.default.input]
    logs    = [otelcol.processor.batch.default.input]
    traces  = [otelcol.processor.batch.default.input]
  }
}

// Exporters
otelcol.exporter.loki "default" {
  forward_to = [loki.write.default.receiver]
}

otelcol.exporter.prometheus "default" {
  forward_to = [prometheus.relabel.default.receiver]
}

otelcol.exporter.otlp "default" {
  client {
    endpoint = "tempo-distributor:4317"

    tls {
      insecure = true
    }
  }
}

Logs

No response

gaantunes commented 2 months ago

Did you install the integration on your Grafana Cloud Connections menu, as described here?

edtshuma commented 2 months ago

Did you install the integration on your Grafana Cloud Connections menu, as described here?

I am not running on Grafana Cloud . My installation is on AWS EKS.

gaantunes commented 2 months ago

Unfortunately this integration is only applicable to Grafana Cloud, but you can use the integration mixin (dashboards/alerts package) within your self hosted Grafana using grizzly.

grr apply mixin.libsonnet

edtshuma commented 2 months ago

How and where do I run this command from ? My Alloy is installed as a set of Helm releases (via Flux/kustomize) with the following resources:

Cluster Helm Release config-cluster.yaml release-cluster.yaml

Node Helm Release config-node.yaml release-node.yaml

and finally a kustomization.yaml file:

apiVersion: kustomize.config.k8s.io/v1beta1
configMapGenerator:
  - behavior: create
    files:
      - config-cluster.alloy
      - config-node.alloy
    name: alloy-config
    namespace: monitoring
    options:
      disableNameSuffixHash: true
kind: Kustomization
resources:
  - release-cluster.yaml
  - release-node.yaml`

And Alloy is deployed as a Statefulset:

```apiVersion: apps/v1
kind: StatefulSet
metadata:
  annotations:
    meta.helm.sh/release-name: alloy-cluster
    meta.helm.sh/release-namespace: monitoring
  labels:
    app.kubernetes.io/instance: alloy-cluster
    app.kubernetes.io/managed-by: Helm
    app.kubernetes.io/name: alloy-cluster
    app.kubernetes.io/part-of: alloy
    app.kubernetes.io/version: v1.3.0
    helm.sh/chart: alloy-0.6.0
    helm.toolkit.fluxcd.io/name: alloy-cluster
    helm.toolkit.fluxcd.io/namespace: monitoring
  name: alloy-cluster
  namespace: monitoring
spec:
  persistentVolumeClaimRetentionPolicy:
    whenDeleted: Delete
    whenScaled: Delete
  podManagementPolicy: Parallel
  replicas: 2
  selector:
    matchLabels:
      app.kubernetes.io/instance: alloy-cluster
      app.kubernetes.io/name: alloy-cluster
  serviceName: alloy-cluster
  template:
    metadata:
      annotations:
        kubectl.kubernetes.io/default-container: alloy
      labels:
        app.kubernetes.io/instance: alloy-cluster
        app.kubernetes.io/name: alloy-cluster
    spec:
      containers:
      - args:
        - run
        - /etc/alloy/config-cluster.alloy
        - --storage.path=/var/lib/alloy
        - --server.http.listen-addr=0.0.0.0:12345
        - --server.http.ui-path-prefix=/
        - --disable-reporting
        - --cluster.enabled=true
        - --cluster.join-addresses=alloy-cluster-cluster
        - --stability.level=generally-available
        env:
        - name: ALLOY_DEPLOY_MODE
          value: helm
        - name: HOSTNAME
          valueFrom:
            fieldRef:
              apiVersion: v1
              fieldPath: spec.nodeName
        image: 0123456789.dkr.ecr.eu-west-1.amazonaws.com/pullthrough/docker.io/grafana/alloy:v1.3.0
        imagePullPolicy: IfNotPresent
        name: alloy
        ports:
        - containerPort: 12345
          name: http-metrics
          protocol: TCP
        - containerPort: 3100
          name: http-loki
          protocol: TCP
        - containerPort: 4317
          name: grpc-otlp
          protocol: TCP
        - containerPort: 4318
          name: http-otlp
          protocol: TCP
        - containerPort: 4319
          name: grpc-otlp-pub
          protocol: TCP
        - containerPort: 9090
          name: http-prom
          protocol: TCP
        - containerPort: 9411
          name: zipkin
          protocol: TCP
        - containerPort: 6831
          name: thrift-compact
          protocol: UDP
        - containerPort: 6832
          name: thrift-binary
          protocol: UDP
        - containerPort: 14250
          name: jaeger-grpc
          protocol: TCP
        - containerPort: 14268
          name: thrift-http
          protocol: TCP
        readinessProbe:
          failureThreshold: 3
          httpGet:
            path: /-/ready
            port: 12345
            scheme: HTTP
          initialDelaySeconds: 10
          periodSeconds: 10
          successThreshold: 1
          timeoutSeconds: 1
        resources:
          requests:
            cpu: 500m
            memory: 3Gi
        volumeMounts:
        - mountPath: /etc/alloy
          name: config
      - args:
        - --volume-dir=/etc/alloy
        - --webhook-url=http://localhost:12345/-/reload
        image: 0123456789.dkr.ecr.eu-west-1.amazonaws.com/pullthrough/ghcr.io/jimmidyson/configmap-reload:v0.12.0
        imagePullPolicy: IfNotPresent
        name: config-reloader
        resources:
          limits:
            cpu: 50m
            memory: 16Mi
          requests:
            cpu: 1m
            memory: 8Mi
        volumeMounts:
        - mountPath: /etc/alloy
          name: config
      dnsPolicy: ClusterFirst
      priorityClassName: system-cluster-critical
      restartPolicy: Always
      serviceAccount: alloy-cluster
      serviceAccountName: alloy-cluster
      terminationGracePeriodSeconds: 30
      topologySpreadConstraints:
      - labelSelector:
          matchLabels:
            app.kubernetes.io/instance: alloy-cluster
            app.kubernetes.io/name: alloy-cluster
        maxSkew: 1
        topologyKey: kubernetes.io/hostname
        whenUnsatisfiable: DoNotSchedule
      - labelSelector:
          matchLabels:
            app.kubernetes.io/instance: alloy-cluster
            app.kubernetes.io/name: alloy-cluster
        maxSkew: 1
        topologyKey: topology.kubernetes.io/zone
        whenUnsatisfiable: DoNotSchedule
      volumes:
      - configMap:
          defaultMode: 420
          name: alloy-config
        name: config
      - emptyDir: {}
        name: alloy-data
  updateStrategy:
    rollingUpdate:
      partition: 0
    type: RollingUpdate
  volumeClaimTemplates:
  - apiVersion: v1
    kind: PersistentVolumeClaim
    metadata:
      name: alloy-data
    spec:
      accessModes:
      - ReadWriteOncePod
      resources:
        requests:
          storage: 10Gi
      storageClassName: ebs-sc
      volumeMode: Filesystem
github-actions[bot] commented 1 month ago

This issue has not had any activity in the past 30 days, so the needs-attention label has been added to it. If the opened issue is a bug, check to see if a newer release fixed your issue. If it is no longer relevant, please feel free to close this issue. The needs-attention label signals to maintainers that something has fallen through the cracks. No action is needed by you; your issue will be kept open and you do not have to respond to this comment. The label will be removed the next time this job runs if there is new activity. Thank you for your contributions!