grafana / alloy

OpenTelemetry Collector distribution with programmable pipelines
https://grafana.com/oss/alloy
Apache License 2.0
1.45k stars 214 forks source link

Readiness probe failed: connect: connection refused #197

Open PatMis16 opened 7 months ago

PatMis16 commented 7 months ago

What's wrong?

Following deployment of the agent using a Helm chart to Kubernetes, the readinessProbe consistently fails with a "connection refused" error. We are only using our own container repository for this deployment, which acts as a proxy for the original container repository on Docker Hub.

Steps to reproduce

Use this deployment (original container repo):

# -- Overrides the chart's name. Used to change the infix in the resource names.
nameOverride: null

# -- Overrides the chart's computed fullname. Used to change the full prefix of
# resource names.
fullnameOverride: null

## Global properties for image pulling override the values defined under `image.registry` and `configReloader.image.registry`.
## If you want to override only one image registry, use the specific fields but if you want to override them all, use `global.image.registry`
global:
  image:
    # -- Global image registry to use if it needs to be overriden for some specific use cases (e.g local registries, custom images, ...)
    registry: "docker.io"

    # -- Optional set of global image pull secrets.
    pullSecrets: []

  # -- Security context to apply to the Grafana Agent pod.
  podSecurityContext:
    fsGroup: 2000
    runAsNonRoot: true
    runAsUser: 1000

crds:
  # -- Whether to install CRDs for monitoring.
  create: false

# Various agent settings.
agent:
  # -- Mode to run Grafana Agent in. Can be "flow" or "static".
  mode: 'flow'
  configMap:
    # -- Create a new ConfigMap for the config file.
    create: false
    # -- Content to assign to the new ConfigMap.  This is passed into `tpl` allowing for templating from values.
    content: ''

    # -- Name of existing ConfigMap to use. Used when create is false.
    name: grafana-agent-config
    # -- Key in ConfigMap to get config from.
    key: config.river

  clustering:
    # -- Deploy agents in a cluster to allow for load distribution. Only
    # applies when agent.mode=flow.
    enabled: true

  # -- Path to where Grafana Agent stores data (for example, the Write-Ahead Log).
  # By default, data is lost between reboots.
  storagePath: /tmp/agent

  # -- Address to listen for traffic on. 0.0.0.0 exposes the UI to other
  # containers.
  listenAddr: 0.0.0.0

  # -- Port to listen for traffic on.
  listenPort: 80

  # -- Scheme is needed for readiness probes. If enabling tls in your configs, set to "HTTPS"
  listenScheme: HTTP

  # --  Base path where the UI is exposed.
  uiPathPrefix: /

  # -- Enables sending Grafana Labs anonymous usage stats to help improve Grafana
  # Agent.
  enableReporting: false

  # -- Extra environment variables to pass to the agent container.
  extraEnv: []

  # -- Maps all the keys on a ConfigMap or Secret as environment variables. https://kubernetes.io/docs/reference/generated/kubernetes-api/v1.24/#envfromsource-v1-core
  envFrom: []

  # -- Extra args to pass to `agent run`: https://grafana.com/docs/agent/latest/flow/reference/cli/run/
  extraArgs: []

  # -- Extra ports to expose on the Agent
  extraPorts: []
  # - name: "faro"
  #   port: 12347
  #   targetPort: 12347
  #   protocol: "TCP"

  mounts:
    # -- Mount /var/log from the host into the container for log collection.
    varlog: false
    # -- Mount /var/lib/docker/containers from the host into the container for log
    # collection.
    dockercontainers: false

    # -- Extra volume mounts to add into the Grafana Agent container. Does not
    # affect the watch container.
    extra:
      - name: gfagent-tmp
        mountPath: /tmp/agent

  # -- Security context to apply to the Grafana Agent container.
  securityContext:
    readOnlyRootFilesystem: true
    allowPrivilegeEscalation: false
    capabilities:
      drop:
        - ALL

  # -- Resource requests and limits to apply to the Grafana Agent container.
  resources: {}

image:
  # -- Grafana Agent image registry (defaults to docker.io)
  registry: "docker.io"
  # -- Grafana Agent image repository.
  repository: grafana/agent
  # -- (string) Grafana Agent image tag. When empty, the Chart's appVersion is
  # used.
  tag: v0.40.3
  # -- Grafana Agent image's SHA256 digest (either in format "sha256:XYZ" or "XYZ"). When set, will override `image.tag`.
  digest: null
  # -- Grafana Agent image pull policy.
  pullPolicy: IfNotPresent
  # -- Optional set of image pull secrets.
  pullSecrets: []

rbac:
  # -- Whether to create RBAC resources for the agent.
  create: false

serviceAccount:
  # -- Whether to create a service account for the Grafana Agent deployment.
  create: true
  # -- Additional labels to add to the created service account.
  additionalLabels: {}
  # -- Annotations to add to the created service account.
  annotations: {}
  # -- The name of the existing service account to use when
  # serviceAccount.create is false.
  name: null

# Options for the extra controller used for config reloading.
configReloader:
  # -- Enables automatically reloading when the agent config changes.
  enabled: true
  image:
    # -- Config reloader image registry (defaults to docker.io)
    registry: "ghcr.io"
    # -- Repository to get config reloader image from.
    repository: jimmidyson/configmap-reload
    # -- Tag of image to use for config reloading.
    tag: v0.12.0
    # -- SHA256 digest of image to use for config reloading (either in format "sha256:XYZ" or "XYZ"). When set, will override `configReloader.image.tag`
    digest: ""
  # -- Override the args passed to the container.
  customArgs: []
  # -- Resource requests and limits to apply to the config reloader container.
  resources:
    requests:
      cpu: "1m"
      memory: "5Mi"
  # -- Security context to apply to the Grafana configReloader container.
  securityContext:
    readOnlyRootFilesystem: true
    allowPrivilegeEscalation: false
    capabilities:
      drop:
        - ALL

controller:
  # -- Type of controller to use for deploying Grafana Agent in the cluster.
  # Must be one of 'daemonset', 'deployment', or 'statefulset'.
  type: 'deployment'

  # -- Number of pods to deploy. Ignored when controller.type is 'daemonset'.
  replicas: 2

  # -- Annotations to add to controller.
  extraAnnotations: {}

  # -- Whether to deploy pods in parallel. Only used when controller.type is
  # 'statefulset'.
  parallelRollout: true

  # -- Configures Pods to use the host network. When set to true, the ports that will be used must be specified.
  hostNetwork: false

  # -- Configures Pods to use the host PID namespace.
  hostPID: false

  # -- Configures the DNS policy for the pod. https://kubernetes.io/docs/concepts/services-networking/dns-pod-service/#pod-s-dns-policy
  dnsPolicy: ClusterFirst

  # -- Update strategy for updating deployed Pods.
  updateStrategy: {}

  # -- nodeSelector to apply to Grafana Agent pods.
  nodeSelector: {}

  # -- Tolerations to apply to Grafana Agent pods.
  tolerations: []

  # -- Topology Spread Constraints to apply to Grafana Agent pods.
  topologySpreadConstraints: []

  # -- priorityClassName to apply to Grafana Agent pods.
  priorityClassName: ''

  # -- Extra pod annotations to add.
  podAnnotations: {}

  # -- Extra pod labels to add.
  podLabels: {}

  # -- Whether to enable automatic deletion of stale PVCs due to a scale down operation, when controller.type is 'statefulset'.
  enableStatefulSetAutoDeletePVC: false

  autoscaling:
    # -- Creates a HorizontalPodAutoscaler for controller type deployment.
    enabled: false
    # -- The lower limit for the number of replicas to which the autoscaler can scale down.
    minReplicas: 2
    # -- The upper limit for the number of replicas to which the autoscaler can scale up.
    maxReplicas: 5
    # -- Average CPU utilization across all relevant pods, a percentage of the requested value of the resource for the pods. Setting `targetCPUUtilizationPercentage` to 0 will disable CPU scaling.
    targetCPUUtilizationPercentage: 0
    # -- Average Memory utilization across all relevant pods, a percentage of the requested value of the resource for the pods. Setting `targetMemoryUtilizationPercentage` to 0 will disable Memory scaling.
    targetMemoryUtilizationPercentage: 80

    scaleDown:
      # -- List of policies to determine the scale-down behavior.
      policies: []
        # - type: Pods
        #   value: 4
        #   periodSeconds: 60
      # -- Determines which of the provided scaling-down policies to apply if multiple are specified.
      selectPolicy: Max
      # -- The duration that the autoscaling mechanism should look back on to make decisions about scaling down.
      stabilizationWindowSeconds: 300

    scaleUp:
      # -- List of policies to determine the scale-up behavior.
      policies: []
        # - type: Pods
        #   value: 4
        #   periodSeconds: 60
      # -- Determines which of the provided scaling-up policies to apply if multiple are specified.
      selectPolicy: Max
      # -- The duration that the autoscaling mechanism should look back on to make decisions about scaling up.
      stabilizationWindowSeconds: 0

  # -- Affinity configuration for pods.
  affinity: {}

  volumes:
    # -- Extra volumes to add to the Grafana Agent pod.
    extra:
      - name: gfagent-tmp
        emptyDir: {}

  # -- volumeClaimTemplates to add when controller.type is 'statefulset'.
  volumeClaimTemplates: []

  ## -- Additional init containers to run.
  ## ref: https://kubernetes.io/docs/concepts/workloads/pods/init-containers/
  ##
  initContainers: []

  # -- Additional containers to run alongside the agent container and initContainers.
  extraContainers: []

service:
  # -- Creates a Service for the controller's pods.
  enabled: true
  # -- Service type
  type: ClusterIP
  # -- NodePort port. Only takes effect when `service.type: NodePort`
  nodePort: 31128
  # -- Cluster IP, can be set to None, empty "" or an IP address
  clusterIP: ''
  # -- Value for internal traffic policy. 'Cluster' or 'Local'
  internalTrafficPolicy: Cluster
  annotations: {}
    # cloud.google.com/load-balancer-type: Internal

serviceMonitor:
  enabled: false
  # -- Additional labels for the service monitor.
  additionalLabels: {}
  # -- Scrape interval. If not set, the Prometheus default scrape interval is used.
  interval: ""
  # -- MetricRelabelConfigs to apply to samples after scraping, but before ingestion.
  # ref: https://github.com/prometheus-operator/prometheus-operator/blob/main/Documentation/api.md#relabelconfig
  metricRelabelings: []
  # - action: keep
  #   regex: 'kube_(daemonset|deployment|pod|namespace|node|statefulset).+'
  #   sourceLabels: [__name__]

  # -- Customize tls parameters for the service monitor
  tlsConfig: {}

  # -- RelabelConfigs to apply to samples before scraping
  # ref: https://github.com/prometheus-operator/prometheus-operator/blob/main/Documentation/api.md#relabelconfig
  relabelings: []
  # - sourceLabels: [__meta_kubernetes_pod_node_name]
  #   separator: ;
  #   regex: ^(.*)$
  #   targetLabel: nodename
  #   replacement: $1
  #   action: replace
ingress:
  # -- Enables ingress for the agent (faro port)
  enabled: false
  # For Kubernetes >= 1.18 you should specify the ingress-controller via the field ingressClassName
  # See https://kubernetes.io/blog/2020/04/02/improvements-to-the-ingress-api-in-kubernetes-1.18/#specifying-the-class-of-an-ingress
  # ingressClassName: nginx
  # Values can be templated
  annotations:
    {}
    # kubernetes.io/ingress.class: nginx
    # kubernetes.io/tls-acme: "true"
  labels: {}
  path: /
  faroPort: 12347

  # pathType is only for k8s >= 1.1=
  pathType: Prefix

  hosts:
    - chart-example.local
  ## Extra paths to prepend to every host configuration. This is useful when working with annotation based services.
  extraPaths: []
  # - path: /*
  #   backend:
  #     serviceName: ssl-redirect
  #     servicePort: use-annotation
  ## Or for k8s > 1.19
  # - path: /*
  #   pathType: Prefix
  #   backend:
  #     service:
  #       name: ssl-redirect
  #       port:
  #         name: use-annotation

  tls: []
  #  - secretName: chart-example-tls
  #    hosts:
  #      - chart-example.local

System information

No response

Software version

Grafana Agent 0.40.3

Configuration

No response

Logs

Readiness probe failed: Get "http://10.123.238.126:80/-/ready": dial tcp 10.123.238.126:80: connect: connection refused
hainenber commented 7 months ago

can you check if grafana/agent image can be pulled into your cluster to be operational? Some pod logs would be helpful :D

rfratto commented 7 months ago

Hi there :wave:

On April 9, 2024, Grafana Labs announced Grafana Alloy, the spirital successor to Grafana Agent and the final form of Grafana Agent flow mode. As a result, Grafana Agent has been deprecated and will only be receiving bug and security fixes until its end-of-life around November 1, 2025.

To make things easier for maintainers, we're in the process of migrating all issues tagged variant/flow to the Grafana Alloy repository to have a single home for tracking issues. This issue is likely something we'll want to address in both Grafana Alloy and Grafana Agent, so just because it's being moved doesn't mean we won't address the issue in Grafana Agent :)

PatMis16 commented 7 months ago

Hi Guys,

Thanks for all the responses. Got it working now. For some reason I had to change the port to 12345.

What is the difference between the Grafana Agent and Grafana Alloy?

BR, Patrick

rfratto commented 7 months ago

Thanks for all the responses. Got it working now. For some reason I had to change the port to 12345.

That's strange, it sounds like it might be a bug somewhere in the chart since the readinessProbe should be bound to the listenPort setting on the Helm chart.

What is the difference between the Grafana Agent and Grafana Alloy?

Grafana Alloy is the evolution of flow mode, renamed in part to make a splash for the 1.0 release (Alloy was released as 1.0). There are minor differences to be aware of when migrating, but Grafana Alloy is Flow mode, with the same concepts, configuration, and components.

tpaschalis commented 7 months ago

That's weird as indeed, the readiness probe reuses the listenPort.

https://github.com/grafana/alloy/blob/3201389252d2c011bee15ace0c9f4cdbcb978f9f/operations/helm/charts/alloy/templates/containers/_agent.yaml#L50-L56

Could it be related to this change last month for merging the agent and alloy keys, and it picking up some other default?

https://github.com/grafana/alloy/blob/3201389252d2c011bee15ace0c9f4cdbcb978f9f/operations/helm/charts/alloy/templates/containers/_agent.yaml#L2

Checking with both keys present, the .Values.agent listenPort takes precedence and I wouldn't expect to have any issues here.

rfratto commented 7 months ago

Could it be related to this change last month for merging the agent and alloy keys, and it picking up some other default?

I don't think so, since the issue reported here was for the Agent Helm chart, not Alloy's.

github-actions[bot] commented 6 months ago

This issue has not had any activity in the past 30 days, so the needs-attention label has been added to it. If the opened issue is a bug, check to see if a newer release fixed your issue. If it is no longer relevant, please feel free to close this issue. The needs-attention label signals to maintainers that something has fallen through the cracks. No action is needed by you; your issue will be kept open and you do not have to respond to this comment. The label will be removed the next time this job runs if there is new activity. Thank you for your contributions!