grafana / loki

Like Prometheus, but for logs.
https://grafana.com/loki
GNU Affero General Public License v3.0
23.4k stars 3.39k forks source link

Distributor/Ingester/Querier error caller=tcp_transport.go:314 msg="TCPTransport: unknown message type" msgType=G #4773

Open chrism417 opened 2 years ago

chrism417 commented 2 years ago

Since upgrading from 2.3.0 to 2.4.1 I've noticed in two separate clusters this new error showing up in the distributor/ingester and querier logs. This was not present prior to the upgrade:

level=error ts=2021-11-17T14:06:06.568364079Z caller=tcp_transport.go:314 msg="TCPTransport: unknown message type" msgType=G

Steps to reproduce the behavior:

Upgrade to version 2.4.1

Expected behavior

What message is this referring to? Is this an error that can be ignored? It wasn't seen in 2.3.0.

Environment:

Kubernetes/helm

Screenshots, Promtail config, or terminal output

The following repeated constantly:

level=error ts=2021-11-17T14:05:54.20841118Z caller=tcp_transport.go:314 msg="TCPTransport: unknown message type" msgType=G level=error ts=2021-11-17T14:05:59.603325534Z caller=tcp_transport.go:314 msg="TCPTransport: unknown message type" msgType=G level=error ts=2021-11-17T14:06:24.207889934Z caller=tcp_transport.go:314 msg="TCPTransport: unknown message type" msgType=G

chaudum commented 2 years ago

This log message emitted when an HTTP request is executed on the memberlist port. Maybe an incorrectly configured readiness check?

Could you share your configuration @chrism417 ?

stale[bot] commented 2 years ago

Hi! This issue has been automatically marked as stale because it has not had any activity in the past 30 days.

We use a stalebot among other tools to help manage the state of issues in this project. A stalebot can be very useful in closing issues in a number of cases; the most common is closing issues or PRs where the original reporter has not responded.

Stalebots are also emotionless and cruel and can close issues which are still very relevant.

If this issue is important to you, please add a comment to keep it open. More importantly, please add a thumbs-up to the original issue entry.

We regularly sort for closed issues which have a stale label sorted by thumbs up.

We may also:

We are doing our best to respond, organize, and prioritize all issues but it can be a challenging task, our sincere apologies if you find yourself at the mercy of the stalebot.

sharathfeb12 commented 2 years ago

Seeing same error in Ingester pod logs on v2.5.0. Here's my config:

apiVersion: apps/v1
kind: StatefulSet
metadata:
  labels:
    app.kubernetes.io/component: ingester
    app.kubernetes.io/instance: loki
    app.kubernetes.io/name: ingester
  name: ingester
  namespace: aiops-logs
spec:
  podManagementPolicy: OrderedReady
  replicas: 20
  revisionHistoryLimit: 10
  selector:
    matchLabels:
      app.kubernetes.io/component: ingester
      app.kubernetes.io/instance: loki
      app.kubernetes.io/name: ingester
      loki.grafana.com/gossip: "true"
  serviceName: ingester
  template:
    metadata:
      labels:
        app.kubernetes.io/component: ingester
        app.kubernetes.io/instance: loki
        app.kubernetes.io/name: ingester
        loki.grafana.com/gossip: "true"
    spec:
      containers:
      - args:
        - -target=ingester
        - -config.file=/etc/loki/config/config.yaml
        - -limits.per-user-override-config=/etc/loki/config/overrides.yaml
        - -log.level=info
        - -distributor.replication-factor=3
        env:
        - name: GOOGLE_APPLICATION_CREDENTIALS
          value: /var/secrets/google/key.json
        image: docker.io/grafana/loki:2.5.0
        imagePullPolicy: IfNotPresent
        livenessProbe:
          failureThreshold: 10
          httpGet:
            path: /metrics
            port: 3100
            scheme: HTTP
          periodSeconds: 30
          successThreshold: 1
          timeoutSeconds: 1
        name: ingester
        ports:
        - containerPort: 3100
          name: metrics
          protocol: TCP
        - containerPort: 9095
          name: grpc
          protocol: TCP
        - containerPort: 7946
          name: gossip-ring
          protocol: TCP
        readinessProbe:
          failureThreshold: 3
          httpGet:
            path: /ready
            port: 3100
            scheme: HTTP
          initialDelaySeconds: 15
          periodSeconds: 10
          successThreshold: 1
          timeoutSeconds: 1
        resources:
          limits:
            cpu: "2"
            memory: 4Gi
          requests:
            cpu: 250m
            memory: 3Gi
        terminationMessagePath: /dev/termination-log
        terminationMessagePolicy: File
        volumeMounts:
        - mountPath: /var/secrets/google
          name: google-cloud-key
        - mountPath: /etc/loki/config/
          name: config
        - mountPath: /data
          name: storage
      dnsPolicy: ClusterFirst
      restartPolicy: Always
      schedulerName: default-scheduler
      securityContext:
        fsGroup: 10001
        runAsGroup: 10001
        runAsNonRoot: true
        runAsUser: 10001
      terminationGracePeriodSeconds: 30
      volumes:
      - configMap:
          defaultMode: 420
          name: loki
        name: config
      - name: google-cloud-key
        secret:
          defaultMode: 420
          secretName: loki-service-account-key
  updateStrategy:
    rollingUpdate:
      partition: 0
    type: RollingUpdate
  volumeClaimTemplates:
  - apiVersion: v1
    kind: PersistentVolumeClaim
    metadata:
      creationTimestamp: null
      labels:
        app.kubernetes.io/component: ingester
        app.kubernetes.io/instance: loki
        app.kubernetes.io/name: ingester
      name: storage
    spec:
      accessModes:
      - ReadWriteOnce
      resources:
        requests:
          storage: 16Gi
      storageClassName: ssd-class
      volumeMode: Filesystem
    status:
      phase: Pending
sharathfeb12 commented 2 years ago

@chaudum : I checked the readiness port and it's configured correctly on port 3100. Could this be related to annotation/label loki.grafana.com/gossip: "true"?

chaudum commented 2 years ago

@chaudum : I checked the readiness port and it's configured correctly on port 3100. Could this be related to annotation/label loki.grafana.com/gossip: "true"?

Where does the label loki.grafana.com/gossip come from? I assume it is used as a selector in a Service for memberlist, right?

It could be, that for some reason, a component points to a a DNS address from a wrong Service.

sharathfeb12 commented 2 years ago

That's right. Not able to understand where DNS could go wrong here.

Below is my K8s service for gossip-ring:

apiVersion: v1
kind: Service
metadata:
  labels:
    app.kubernetes.io/instance: loki
  name: gossip-ring
  namespace: logs
spec:
  clusterIP: None
  ports:
  - name: gossip
    port: 7946
    protocol: TCP
    targetPort: 7946
  selector:
    loki.grafana.com/gossip: "true"
  sessionAffinity: None
  type: ClusterIP

Please find the configmap on the Loki side below:

apiVersion: v1
data:
  config.yaml: |-
    "auth_enabled": true
    "compactor":
      "compaction_interval": "10m"
      "shared_store": "gcs"
      "working_directory": "/data/loki/compactor"
      "retention_enabled": true
    "distributor":
      "ring":
        "kvstore":
          "store": "memberlist"
    "frontend":
      "compress_responses": false
      "max_outstanding_per_tenant": 2048
      "tail_proxy_url": "http://querier.ep-dev-logs.svc.cluster.local:3100"
    "frontend_worker":
      "frontend_address": "queryfrontend.ep-dev-logs.svc.cluster.local:9095"
      "grpc_client_config":
        "max_send_msg_size": 1104857600
      "parallelism": 256
    "ingester":
      "chunk_block_size": 262144
      "chunk_target_size": 1536000
      "chunk_encoding": "snappy"
      "chunk_idle_period": "30m"
      "autoforget_unhealthy": true
      "lifecycler":
        "heartbeat_period": "1m"
        "interface_names":
        - "eth0"
        "num_tokens": 512
        "ring":
          "kvstore":
            "store": "memberlist"
      "max_transfer_retries": 0
      "wal":
        "enabled": true
        "dir": "data"
    "ingester_client":
      "grpc_client_config":
        "max_recv_msg_size": 1104857600
        "max_send_msg_size": 1104857600
        "backoff_on_ratelimits": true
        "backoff_config":
          "min_period": "1s"
          "max_period": "32s"
          "max_retries": 10
      "remote_timeout": "10s"
    "limits_config":
      "enforce_metric_name": false
      "ingestion_burst_size_mb": 512
      "ingestion_rate_mb": 256
      "per_stream_rate_limit": "128MB"
      "per_stream_rate_limit_burst": "64MB"
      "ingestion_rate_strategy": "global"
      "max_cache_freshness_per_query": "10m"
      "max_global_streams_per_user": 100000
      "max_query_length": "12000h"
      "max_query_parallelism": 128
      "max_query_series": 10000
      "max_streams_per_user": 1000000
      "reject_old_samples": false
      "reject_old_samples_max_age": "720h"
      "max_entries_limit_per_query": 100000
      "cardinality_limit": 100000
      "retention_period": 48h
      "split_queries_by_interval": "30m"
    "memberlist":
      "abort_if_cluster_join_fails": false
      "bind_port": 7946
      "join_members":
      - "gossip-ring.ep-dev-logs.svc.cluster.local:7946"
      "max_join_backoff": "1m"
      "max_join_retries": 10
      "min_join_backoff": "1s"
    "querier":
      "engine":
        "timeout": "15m"
      "extra_query_delay": "0s"
      "query_ingesters_within": "3h"
      "query_timeout": "15m"
      "tail_max_duration": "15m"
      "max_concurrent": 256
    "query_range":
      "align_queries_with_step": true
      "cache_results": true
      "max_retries": 5
      "parallelise_shardable_queries": true
      "results_cache":
        "cache":
          "memcached":
            "expiration": "86400s"
            "batch_size": 1024
            "parallelism": 300
          "memcached_client":
            "host": "memcached-frontend.ep-dev-logs.svc"
            "service": "memcached"
    "schema_config":
      "configs":
      - "from": "2020-10-01"
        "index":
          "period": "24h"
          "prefix": "loki_index_"
        "object_store": "gcs"
        "schema": "v11"
        "store": "boltdb-shipper"
    "server":
      "graceful_shutdown_timeout": "5s"
      "grpc_server_max_concurrent_streams": 1000
      "grpc_server_max_recv_msg_size": 1104857600
      "grpc_server_max_send_msg_size": 1104857600
      "http_listen_port": 3100
      "http_server_idle_timeout": "3m"
      "http_server_write_timeout": "1m"
      "http_server_read_timeout": "15m"
    "storage_config":
      "boltdb_shipper":
        "active_index_directory": "/data/loki/index"
        "cache_location": "/data/loki/index_cache"
        "cache_ttl": "24h"
        "resync_interval": "5m"
        "shared_store": "gcs"
        "index_gateway_client":
          "server_address": "dns:///indexgateway:9095"
      "gcs":
        "bucket_name": ep-dev-logs
      "index_queries_cache_config":
        "memcached":
          "expiration": "86400s"
          "batch_size": 1024
          "parallelism": 256
        "memcached_client":
          "host": "memcached-index-queries.ep-dev-logs.svc"
          "service": "memcached"
    "chunk_store_config":
      "chunk_cache_config":
        "memcached":
          "expiration": "86400s"
          "batch_size": 1024
          "parallelism": 256
        "memcached_client":
          "host": "memcached-chunks.ep-dev-logs.svc"
          "service": "memcached"
      "write_dedupe_cache_config":
        "memcached":
            "expiration": "86400s"
            "batch_size": 1024
            "parallelism": 256
        memcached_client:
            "consistent_hash": true
            "host": "memcached-index-writes.ep-dev-logs.svc."
            "service": memcached
  overrides.yaml: '{}'
kind: ConfigMap
metadata:
  labels:
    app.kubernetes.io/team: ep-dev
  name: loki
  namespace: logs
sharathfeb12 commented 2 years ago

@chaudum : Anything from config that looks wrong?

Here is my graph of the same.

image
sharathfeb12 commented 2 years ago

Below is my Ring status:

image

When I look at pod logs which are in Unhealthy state, I find these messages:

level=error ts=2022-06-14T05:19:11.302133608Z caller=tcp_transport.go:318 component="memberlist TCPTransport" msg="unknown message type" msgType=G remote=172.20.15.85:55002
level=error ts=2022-06-14T05:19:20.290404752Z caller=tcp_transport.go:318 component="memberlist TCPTransport" msg="unknown message type" msgType=G remote=172.20.8.89:56878
level=error ts=2022-06-14T05:19:41.302077887Z caller=tcp_transport.go:318 component="memberlist TCPTransport" msg="unknown message type" msgType=G remote=172.20.15.85:36272
level=error ts=2022-06-14T05:19:50.290767375Z caller=tcp_transport.go:318 component="memberlist TCPTransport" msg="unknown message type" msgType=G remote=172.20.8.89:38912

when I looked at the IP addresses which are present in the logs, they are prometheus pod IPs. Why would ingester try to contact these pods? It looks like the pod is trying to establish memberlist with wrong pods and hence giving this error?

Prometheus pods does not have the annotation which is needed for gossip-ring. So, I dont understand why ingester pods are even trying this connection?

chaudum commented 2 years ago

@chaudum : Anything from config that looks wrong?

I don't see anything wrong with your config. The only thing I realized is that the address in join_members has a different namespace than the Service, but I assume that comes from copying the service from the wrong namespace.

when I looked at the IP addresses which are present in the logs, they are prometheus pod IPs. Why would ingester try to contact these pods? It looks like the pod is trying to establish memberlist with wrong pods and hence giving this error?

Prometheus pods does not have the annotation which is needed for gossip-ring. So, I dont understand why ingester pods are even trying this connection?

Thanks for your further investigation. That is indeed odd and I cannot really explain it tbh. But it's actually the other way round: Prometheus tries to access the ingesters, which then los the warning. The explanation for that would be that the scrape config from Prometheus points to a wrong service.

I am a bit worried that this could cause ingesters to become unhealthy, though.

There is a set of metrics prefixed loki_memberlist_tcp_transport_* that you can use to monitor the memberlist transport.

sharathfeb12 commented 2 years ago

I fixed the Prometheus scrape config by adding annotations to the pods and scraping just the required port.

This reduced the rate of these entries on the ingester but I do still see these messages.

M-JobPixel commented 2 years ago

I just installed loki and prometheus from the helm charts, and am seeing these errors in my loki logs.

As @sharathfeb12 says it is the prometheus-server pod IP in the error message.

Looking at the scrape config in the prometheus helm chart I only see

https://github.com/prometheus-community/helm-charts/blob/61798838797098f2add9c35f4798f3edb2a7122f/charts/prometheus/values.yaml#L1452-L1453

Which doesn't seem to me like it should be scraping the ingester ports.

I'll try and dig into this further

stale[bot] commented 2 years ago

Hi! This issue has been automatically marked as stale because it has not had any activity in the past 30 days.

We use a stalebot among other tools to help manage the state of issues in this project. A stalebot can be very useful in closing issues in a number of cases; the most common is closing issues or PRs where the original reporter has not responded.

Stalebots are also emotionless and cruel and can close issues which are still very relevant.

If this issue is important to you, please add a comment to keep it open. More importantly, please add a thumbs-up to the original issue entry.

We regularly sort for closed issues which have a stale label sorted by thumbs up.

We may also:

We are doing our best to respond, organize, and prioritize all issues but it can be a challenging task, our sincere apologies if you find yourself at the mercy of the stalebot.

aug70 commented 2 years ago

I had to explicitly give the correct port in Helm values (which I thought it should be defaulted to 3100 in the first place?!?)

loki:
  podAnnotations:
    prometheus.io/scrape: "true"
    prometheus.io/port: "3100"

Otherwise Loki Helm chart with default values creates the pod with annotations below

              prometheus.io/port: http-metrics
              prometheus.io/scrape: true

this causes prometheus trying to scrape port 7946 and 9095 and the error comes up.