grafana / alloy

OpenTelemetry Collector distribution with programmable pipelines
https://grafana.com/oss/alloy
Apache License 2.0
932 stars 98 forks source link

http2: response body closed #281

Open rfratto opened 5 months ago

rfratto commented 5 months ago

Discussed in https://github.com/grafana/agent/discussions/5967

Originally posted by **cedric-mgx** December 13, 2023 Hi, I'm using the grafana agent in flow mode on kubernetes and see this error popping all the time for different pods (few times per minutes), but when i check all the logs for those pods are there. ``` level=warn msg="tailer stopped; will retry" target=monitoring/pod-id:grafana component=loki.source.kubernetes.pods err="http2: response body closed" ``` The agent are installed with the following helm chart config: ``` fullnameOverride: grafana-agent # Various agent settings. agent: # -- Mode to run Grafana Agent in. Can be "flow" or "static". mode: 'flow' configMap: # -- Create a new ConfigMap for the config file. create: true # -- Content to assign to the new ConfigMap. This is passed into `tpl` allowing for templating from values. content: | logging { level = "warn" format = "logfmt" } // Endpoints prometheus.remote_write "mimir_endpoint" { external_labels = { cluster = env("CLUSTER_NAME"), } endpoint { url = "http://mimir-nginx.monitoring.svc/api/v1/push" queue_config { capacity = 5000 max_shards = 100 max_samples_per_send = 1000 batch_send_deadline = "1m0s" } metadata_config { } write_relabel_config { source_labels = ["__name__"] regex = "(etcd_request_duration_seconds_bucket|etcd_request_duration_seconds_bucket|apiserver_response_sizes_bucket|prober_probe_duration_seconds_bucket|apiserver_storage_list_duration_seconds_bucket|apiserver_watch_events_sizes_bucket|container_memory_failures_total|container_blkio_device_usage_total)" action = "drop" } } } loki.write "loki_endpoint" { endpoint { url = "http://loki-gateway.monitoring.svc/loki/api/v1/push" } external_labels = { cluster = "observability", } } // Sources discovery.kubernetes "pods" { role = "pod" } discovery.kubernetes "nodes" { role = "node" } discovery.kubernetes "services" { role = "service" } discovery.kubernetes "endpoints" { role = "endpoints" } discovery.kubernetes "endpointslices" { role = "endpointslice" } discovery.kubernetes "ingresses" { role = "ingress" } // Metrics prometheus.operator.servicemonitors { forward_to = [ prometheus.remote_write.mimir_endpoint.receiver, ] clustering { enabled = true } } // kubelet discovery.relabel "metrics_kubelet" { targets = discovery.kubernetes.nodes.targets rule { action = "replace" target_label = "__address__" replacement = "kubernetes.default.svc.cluster.local:443" } rule { source_labels = ["__meta_kubernetes_node_name"] regex = "(.+)" action = "replace" replacement = "/api/v1/nodes/${1}/proxy/metrics" target_label = "__metrics_path__" } rule { source_labels = ["__name__"] regex = ":node_memory_MemAvailable_bytes:sum|alertmanager_alerts|alertmanager_alerts_invalid_total" action = "keep" } } prometheus.scrape "kubelet" { scheme = "https" tls_config { server_name = "kubernetes" ca_file = "/var/run/secrets/kubernetes.io/serviceaccount/ca.crt" insecure_skip_verify = false } bearer_token_file = "/var/run/secrets/kubernetes.io/serviceaccount/token" targets = discovery.relabel.metrics_kubelet.output scrape_interval = "60s" forward_to = [prometheus.remote_write.mimir_endpoint.receiver] } // cadvisor prometheus.scrape "cadvisor" { scheme = "https" tls_config { server_name = "kubernetes" ca_file = "/var/run/secrets/kubernetes.io/serviceaccount/ca.crt" insecure_skip_verify = false } bearer_token_file = "/var/run/secrets/kubernetes.io/serviceaccount/token" targets = discovery.relabel.metrics_cadvisor.output scrape_interval = "60s" forward_to = [prometheus.remote_write.mimir_endpoint.receiver] } discovery.relabel "metrics_cadvisor" { targets = discovery.kubernetes.nodes.targets rule { action = "replace" target_label = "__address__" replacement = "kubernetes.default.svc.cluster.local:443" } rule { source_labels = ["__meta_kubernetes_node_name"] regex = "(.+)" action = "replace" replacement = "/api/v1/nodes/${1}/proxy/metrics/cadvisor" target_label = "__metrics_path__" } } mimir.rules.kubernetes "prometheus_rules" { address = "http://mimir-nginx.monitoring.svc" } // Logging // https://grafana.com/docs/agent/latest/flow/reference/components/discovery.kubernetes/ discovery.relabel "logs" { targets = discovery.kubernetes.pods.targets rule { source_labels = ["__meta_kubernetes_namespace"] target_label = "namespace" } rule { source_labels = ["__meta_kubernetes_pod_container_name"] target_label = "container" } rule { source_labels = ["__meta_kubernetes_pod_name"] target_label = "pod" } rule { source_labels = ["__meta_kubernetes_pod_label_app_kubernetes_io_name"] target_label = "app" } rule { source_labels = ["__meta_kubernetes_pod_node_name"] target_label = "node_name" } } loki.source.kubernetes "pods" { targets = discovery.relabel.logs.output forward_to = [loki.write.loki_endpoint.receiver] clustering { enabled = true } } loki.source.kubernetes_events "events" { log_format = "logfmt" forward_to = [loki.write.loki_endpoint.receiver] } # -- Name of existing ConfigMap to use. Used when create is false. name: null # -- Key in ConfigMap to get config from. key: null clustering: # -- Deploy agents in a cluster to allow for load distribution. Only # applies when agent.mode=flow. enabled: true # -- Path to where Grafana Agent stores data (for example, the Write-Ahead Log). # By default, data is lost between reboots. storagePath: /tmp/agent # -- Address to listen for traffic on. 0.0.0.0 exposes the UI to other # containers. listenAddr: 0.0.0.0 # -- Port to listen for traffic on. listenPort: 80 # -- Base path where the UI is exposed. uiPathPrefix: / # -- Enables sending Grafana Labs anonymous usage stats to help improve Grafana # Agent. enableReporting: true # -- Extra environment variables to pass to the agent container. extraEnv: - name: CLUSTER_NAME value: observability # -- Maps all the keys on a ConfigMap or Secret as environment variables. https://kubernetes.io/docs/reference/generated/kubernetes-api/v1.24/#envfromsource-v1-core envFrom: [] # -- Extra args to pass to `agent run`: https://grafana.com/docs/agent/latest/flow/reference/cli/run/ extraArgs: [] # -- Extra ports to expose on the Agent extraPorts: [] # - name: "faro" # port: 12347 # targetPort: 12347 # protocol: "TCP" # -- Security context to apply to the Grafana Agent container. securityContext: {} # -- Resource requests and limits to apply to the Grafana Agent container. resources: {} serviceAccount: # -- Whether to create a service account for the Grafana Agent deployment. create: true # -- Annotations to add to the created service account. annotations: {} controller: type: 'daemonset' ``` Can anyone tell me anything about those logs? Also it seems to be happening 90% of the times with the same applications, like the grafana or ansible pods for example
rfratto commented 5 months ago

As I posted in #5967:

loki.source.kubernetes will regularly refresh connections to the Kubernetes API due to a bug in many versions of Kubernetes where connections can go silent after a log file rolls.

What's likely happening here is that we're improperly logging a warning when we refresh connections instead of logging nothing.

R-Studio commented 5 months ago

Same issue here, any news?

IRusio commented 4 months ago

looks like that this refresh can in some way be a reason that windows pods are not sending all logs who he receive from stdout(we are using logs explorer to send that data on output)

github-actions[bot] commented 3 months ago

This issue has not had any activity in the past 30 days, so the needs-attention label has been added to it. If the opened issue is a bug, check to see if a newer release fixed your issue. If it is no longer relevant, please feel free to close this issue. The needs-attention label signals to maintainers that something has fallen through the cracks. No action is needed by you; your issue will be kept open and you do not have to respond to this comment. The label will be removed the next time this job runs if there is new activity. Thank you for your contributions!

rfratto commented 2 months ago

Hi there :wave:

On April 9, 2024, Grafana Labs announced Grafana Alloy, the spirital successor to Grafana Agent and the final form of Grafana Agent flow mode. As a result, Grafana Agent has been deprecated and will only be receiving bug and security fixes until its end-of-life around November 1, 2025.

To make things easier for maintainers, we're in the process of migrating all issues tagged variant/flow to the Grafana Alloy repository to have a single home for tracking issues. This issue is likely something we'll want to address in both Grafana Alloy and Grafana Agent, so just because it's being moved doesn't mean we won't address the issue in Grafana Agent :)