Open chrism417 opened 2 years ago
This log message emitted when an HTTP request is executed on the memberlist port. Maybe an incorrectly configured readiness check?
Could you share your configuration @chrism417 ?
Hi! This issue has been automatically marked as stale because it has not had any activity in the past 30 days.
We use a stalebot among other tools to help manage the state of issues in this project. A stalebot can be very useful in closing issues in a number of cases; the most common is closing issues or PRs where the original reporter has not responded.
Stalebots are also emotionless and cruel and can close issues which are still very relevant.
If this issue is important to you, please add a comment to keep it open. More importantly, please add a thumbs-up to the original issue entry.
We regularly sort for closed issues which have a stale
label sorted by thumbs up.
We may also:
revivable
if we think it's a valid issue but isn't something we are likely
to prioritize in the future (the issue will still remain closed).keepalive
label to silence the stalebot if the issue is very common/popular/important.We are doing our best to respond, organize, and prioritize all issues but it can be a challenging task, our sincere apologies if you find yourself at the mercy of the stalebot.
Seeing same error in Ingester pod logs on v2.5.0. Here's my config:
apiVersion: apps/v1
kind: StatefulSet
metadata:
labels:
app.kubernetes.io/component: ingester
app.kubernetes.io/instance: loki
app.kubernetes.io/name: ingester
name: ingester
namespace: aiops-logs
spec:
podManagementPolicy: OrderedReady
replicas: 20
revisionHistoryLimit: 10
selector:
matchLabels:
app.kubernetes.io/component: ingester
app.kubernetes.io/instance: loki
app.kubernetes.io/name: ingester
loki.grafana.com/gossip: "true"
serviceName: ingester
template:
metadata:
labels:
app.kubernetes.io/component: ingester
app.kubernetes.io/instance: loki
app.kubernetes.io/name: ingester
loki.grafana.com/gossip: "true"
spec:
containers:
- args:
- -target=ingester
- -config.file=/etc/loki/config/config.yaml
- -limits.per-user-override-config=/etc/loki/config/overrides.yaml
- -log.level=info
- -distributor.replication-factor=3
env:
- name: GOOGLE_APPLICATION_CREDENTIALS
value: /var/secrets/google/key.json
image: docker.io/grafana/loki:2.5.0
imagePullPolicy: IfNotPresent
livenessProbe:
failureThreshold: 10
httpGet:
path: /metrics
port: 3100
scheme: HTTP
periodSeconds: 30
successThreshold: 1
timeoutSeconds: 1
name: ingester
ports:
- containerPort: 3100
name: metrics
protocol: TCP
- containerPort: 9095
name: grpc
protocol: TCP
- containerPort: 7946
name: gossip-ring
protocol: TCP
readinessProbe:
failureThreshold: 3
httpGet:
path: /ready
port: 3100
scheme: HTTP
initialDelaySeconds: 15
periodSeconds: 10
successThreshold: 1
timeoutSeconds: 1
resources:
limits:
cpu: "2"
memory: 4Gi
requests:
cpu: 250m
memory: 3Gi
terminationMessagePath: /dev/termination-log
terminationMessagePolicy: File
volumeMounts:
- mountPath: /var/secrets/google
name: google-cloud-key
- mountPath: /etc/loki/config/
name: config
- mountPath: /data
name: storage
dnsPolicy: ClusterFirst
restartPolicy: Always
schedulerName: default-scheduler
securityContext:
fsGroup: 10001
runAsGroup: 10001
runAsNonRoot: true
runAsUser: 10001
terminationGracePeriodSeconds: 30
volumes:
- configMap:
defaultMode: 420
name: loki
name: config
- name: google-cloud-key
secret:
defaultMode: 420
secretName: loki-service-account-key
updateStrategy:
rollingUpdate:
partition: 0
type: RollingUpdate
volumeClaimTemplates:
- apiVersion: v1
kind: PersistentVolumeClaim
metadata:
creationTimestamp: null
labels:
app.kubernetes.io/component: ingester
app.kubernetes.io/instance: loki
app.kubernetes.io/name: ingester
name: storage
spec:
accessModes:
- ReadWriteOnce
resources:
requests:
storage: 16Gi
storageClassName: ssd-class
volumeMode: Filesystem
status:
phase: Pending
@chaudum : I checked the readiness port and it's configured correctly on port 3100. Could this be related to annotation/label loki.grafana.com/gossip: "true"?
@chaudum : I checked the readiness port and it's configured correctly on port 3100. Could this be related to annotation/label loki.grafana.com/gossip: "true"?
Where does the label loki.grafana.com/gossip
come from? I assume it is used as a selector in a Service
for memberlist, right?
It could be, that for some reason, a component points to a a DNS address from a wrong Service
.
That's right. Not able to understand where DNS could go wrong here.
Below is my K8s service for gossip-ring:
apiVersion: v1
kind: Service
metadata:
labels:
app.kubernetes.io/instance: loki
name: gossip-ring
namespace: logs
spec:
clusterIP: None
ports:
- name: gossip
port: 7946
protocol: TCP
targetPort: 7946
selector:
loki.grafana.com/gossip: "true"
sessionAffinity: None
type: ClusterIP
Please find the configmap on the Loki side below:
apiVersion: v1
data:
config.yaml: |-
"auth_enabled": true
"compactor":
"compaction_interval": "10m"
"shared_store": "gcs"
"working_directory": "/data/loki/compactor"
"retention_enabled": true
"distributor":
"ring":
"kvstore":
"store": "memberlist"
"frontend":
"compress_responses": false
"max_outstanding_per_tenant": 2048
"tail_proxy_url": "http://querier.ep-dev-logs.svc.cluster.local:3100"
"frontend_worker":
"frontend_address": "queryfrontend.ep-dev-logs.svc.cluster.local:9095"
"grpc_client_config":
"max_send_msg_size": 1104857600
"parallelism": 256
"ingester":
"chunk_block_size": 262144
"chunk_target_size": 1536000
"chunk_encoding": "snappy"
"chunk_idle_period": "30m"
"autoforget_unhealthy": true
"lifecycler":
"heartbeat_period": "1m"
"interface_names":
- "eth0"
"num_tokens": 512
"ring":
"kvstore":
"store": "memberlist"
"max_transfer_retries": 0
"wal":
"enabled": true
"dir": "data"
"ingester_client":
"grpc_client_config":
"max_recv_msg_size": 1104857600
"max_send_msg_size": 1104857600
"backoff_on_ratelimits": true
"backoff_config":
"min_period": "1s"
"max_period": "32s"
"max_retries": 10
"remote_timeout": "10s"
"limits_config":
"enforce_metric_name": false
"ingestion_burst_size_mb": 512
"ingestion_rate_mb": 256
"per_stream_rate_limit": "128MB"
"per_stream_rate_limit_burst": "64MB"
"ingestion_rate_strategy": "global"
"max_cache_freshness_per_query": "10m"
"max_global_streams_per_user": 100000
"max_query_length": "12000h"
"max_query_parallelism": 128
"max_query_series": 10000
"max_streams_per_user": 1000000
"reject_old_samples": false
"reject_old_samples_max_age": "720h"
"max_entries_limit_per_query": 100000
"cardinality_limit": 100000
"retention_period": 48h
"split_queries_by_interval": "30m"
"memberlist":
"abort_if_cluster_join_fails": false
"bind_port": 7946
"join_members":
- "gossip-ring.ep-dev-logs.svc.cluster.local:7946"
"max_join_backoff": "1m"
"max_join_retries": 10
"min_join_backoff": "1s"
"querier":
"engine":
"timeout": "15m"
"extra_query_delay": "0s"
"query_ingesters_within": "3h"
"query_timeout": "15m"
"tail_max_duration": "15m"
"max_concurrent": 256
"query_range":
"align_queries_with_step": true
"cache_results": true
"max_retries": 5
"parallelise_shardable_queries": true
"results_cache":
"cache":
"memcached":
"expiration": "86400s"
"batch_size": 1024
"parallelism": 300
"memcached_client":
"host": "memcached-frontend.ep-dev-logs.svc"
"service": "memcached"
"schema_config":
"configs":
- "from": "2020-10-01"
"index":
"period": "24h"
"prefix": "loki_index_"
"object_store": "gcs"
"schema": "v11"
"store": "boltdb-shipper"
"server":
"graceful_shutdown_timeout": "5s"
"grpc_server_max_concurrent_streams": 1000
"grpc_server_max_recv_msg_size": 1104857600
"grpc_server_max_send_msg_size": 1104857600
"http_listen_port": 3100
"http_server_idle_timeout": "3m"
"http_server_write_timeout": "1m"
"http_server_read_timeout": "15m"
"storage_config":
"boltdb_shipper":
"active_index_directory": "/data/loki/index"
"cache_location": "/data/loki/index_cache"
"cache_ttl": "24h"
"resync_interval": "5m"
"shared_store": "gcs"
"index_gateway_client":
"server_address": "dns:///indexgateway:9095"
"gcs":
"bucket_name": ep-dev-logs
"index_queries_cache_config":
"memcached":
"expiration": "86400s"
"batch_size": 1024
"parallelism": 256
"memcached_client":
"host": "memcached-index-queries.ep-dev-logs.svc"
"service": "memcached"
"chunk_store_config":
"chunk_cache_config":
"memcached":
"expiration": "86400s"
"batch_size": 1024
"parallelism": 256
"memcached_client":
"host": "memcached-chunks.ep-dev-logs.svc"
"service": "memcached"
"write_dedupe_cache_config":
"memcached":
"expiration": "86400s"
"batch_size": 1024
"parallelism": 256
memcached_client:
"consistent_hash": true
"host": "memcached-index-writes.ep-dev-logs.svc."
"service": memcached
overrides.yaml: '{}'
kind: ConfigMap
metadata:
labels:
app.kubernetes.io/team: ep-dev
name: loki
namespace: logs
@chaudum : Anything from config that looks wrong?
Here is my graph of the same.
Below is my Ring status:
When I look at pod logs which are in Unhealthy state, I find these messages:
level=error ts=2022-06-14T05:19:11.302133608Z caller=tcp_transport.go:318 component="memberlist TCPTransport" msg="unknown message type" msgType=G remote=172.20.15.85:55002
level=error ts=2022-06-14T05:19:20.290404752Z caller=tcp_transport.go:318 component="memberlist TCPTransport" msg="unknown message type" msgType=G remote=172.20.8.89:56878
level=error ts=2022-06-14T05:19:41.302077887Z caller=tcp_transport.go:318 component="memberlist TCPTransport" msg="unknown message type" msgType=G remote=172.20.15.85:36272
level=error ts=2022-06-14T05:19:50.290767375Z caller=tcp_transport.go:318 component="memberlist TCPTransport" msg="unknown message type" msgType=G remote=172.20.8.89:38912
when I looked at the IP addresses which are present in the logs, they are prometheus pod IPs. Why would ingester try to contact these pods? It looks like the pod is trying to establish memberlist with wrong pods and hence giving this error?
Prometheus pods does not have the annotation which is needed for gossip-ring. So, I dont understand why ingester pods are even trying this connection?
@chaudum : Anything from config that looks wrong?
I don't see anything wrong with your config. The only thing I realized is that the address in join_members
has a different namespace than the Service
, but I assume that comes from copying the service from the wrong namespace.
when I looked at the IP addresses which are present in the logs, they are prometheus pod IPs. Why would ingester try to contact these pods? It looks like the pod is trying to establish memberlist with wrong pods and hence giving this error?
Prometheus pods does not have the annotation which is needed for gossip-ring. So, I dont understand why ingester pods are even trying this connection?
Thanks for your further investigation. That is indeed odd and I cannot really explain it tbh. But it's actually the other way round: Prometheus tries to access the ingesters, which then los the warning. The explanation for that would be that the scrape config from Prometheus points to a wrong service.
I am a bit worried that this could cause ingesters to become unhealthy, though.
There is a set of metrics prefixed loki_memberlist_tcp_transport_*
that you can use to monitor the memberlist transport.
I fixed the Prometheus scrape config by adding annotations to the pods and scraping just the required port.
This reduced the rate of these entries on the ingester but I do still see these messages.
I just installed loki and prometheus from the helm charts, and am seeing these errors in my loki logs.
As @sharathfeb12 says it is the prometheus-server pod IP in the error message.
Looking at the scrape config in the prometheus helm chart I only see
Which doesn't seem to me like it should be scraping the ingester ports.
I'll try and dig into this further
Hi! This issue has been automatically marked as stale because it has not had any activity in the past 30 days.
We use a stalebot among other tools to help manage the state of issues in this project. A stalebot can be very useful in closing issues in a number of cases; the most common is closing issues or PRs where the original reporter has not responded.
Stalebots are also emotionless and cruel and can close issues which are still very relevant.
If this issue is important to you, please add a comment to keep it open. More importantly, please add a thumbs-up to the original issue entry.
We regularly sort for closed issues which have a stale
label sorted by thumbs up.
We may also:
revivable
if we think it's a valid issue but isn't something we are likely
to prioritize in the future (the issue will still remain closed).keepalive
label to silence the stalebot if the issue is very common/popular/important.We are doing our best to respond, organize, and prioritize all issues but it can be a challenging task, our sincere apologies if you find yourself at the mercy of the stalebot.
I had to explicitly give the correct port in Helm values (which I thought it should be defaulted to 3100
in the first place?!?)
loki:
podAnnotations:
prometheus.io/scrape: "true"
prometheus.io/port: "3100"
Otherwise Loki Helm chart with default values creates the pod with annotations below
prometheus.io/port: http-metrics
prometheus.io/scrape: true
this causes prometheus trying to scrape port 7946 and 9095 and the error comes up.
Since upgrading from 2.3.0 to 2.4.1 I've noticed in two separate clusters this new error showing up in the distributor/ingester and querier logs. This was not present prior to the upgrade:
level=error ts=2021-11-17T14:06:06.568364079Z caller=tcp_transport.go:314 msg="TCPTransport: unknown message type" msgType=G
Steps to reproduce the behavior:
Upgrade to version 2.4.1
Expected behavior
What message is this referring to? Is this an error that can be ignored? It wasn't seen in 2.3.0.
Environment:
Kubernetes/helm
Screenshots, Promtail config, or terminal output
The following repeated constantly:
level=error ts=2021-11-17T14:05:54.20841118Z caller=tcp_transport.go:314 msg="TCPTransport: unknown message type" msgType=G level=error ts=2021-11-17T14:05:59.603325534Z caller=tcp_transport.go:314 msg="TCPTransport: unknown message type" msgType=G level=error ts=2021-11-17T14:06:24.207889934Z caller=tcp_transport.go:314 msg="TCPTransport: unknown message type" msgType=G