Open jlynch93 opened 1 year ago
Created the same issue in Tempo as well: https://github.com/grafana/tempo/issues/2766. You can find the Tempo configs there as well!
We also faced the same and faced 3 times till now in last 1 years.
I've dealt with this randomly. What i found worked was a strong hostname for the join and a prefix like this
memberlistConfig:
cluster_label: loki-dev
join_members:
- loki-memberlist.loki-dev.svc.cluster.local:7946
On the mimir side you can do the same thing.. pretty sure you can with tempo also
memberlist:
cluster_label: mimir
join_members:
- dns+{{ include "mimir.fullname" . }}-gossip-ring.{{ .Release.Namespace }}.svc.{{ .Values.global.clusterDomain }}:{{ include "mimir.memberlistBindPort" . }}
I've dealt with this randomly. What i found worked was a strong hostname for the join and a prefix like this
memberlistConfig: cluster_label: loki-dev join_members: - loki-memberlist.loki-dev.svc.cluster.local:7946
On the mimir side you can do the same thing.. pretty sure you can with tempo also
memberlist: cluster_label: mimir join_members: - dns+{{ include "mimir.fullname" . }}-gossip-ring.{{ .Release.Namespace }}.svc.{{ .Values.global.clusterDomain }}:{{ include "mimir.memberlistBindPort" . }}
Wow, this seems to do the trick, thanks!
But what a mess, how can this gotcha not be clearly documented somewhere? Before finding this issue comment, I saw https://github.com/grafana/loki/issues/10537 which was not very helpful. This https://github.com/grafana/mimir/issues/2865 pointed me in the right direction to finding this issue.
Edit: Spoke too soon, the problem persists...
Edit 2: Not sure if fixed or not. At least I have not seen the error for a day now. Seemed that it took a greater part of the weekend to stabilize.
Describe the bug Tempo Ingesters registerd to lokis ingester ring which caused loki to go down and stop returning logs.
To Reproduce Steps to reproduce the behavior: Unsure of how to reproduce this issue as it has never happened in our current deployment before.
Expected behavior Loki ingesters should register to loki and tempo ingesters should register to tempo.
Environment: Current deployed is using tempo-distributed helm chart into eks. attached is the loki config
Loki gateway nginx config
Screenshots, Promtail config, or terminal output Only log line that directed us to the issue was:
level=warn ts=2023-08-04T14:32:18.386282517Z caller=logging.go:86 traceID=54e1a62fbdffbc09 orgID=fake msg="POST /loki/api/v1/push (500) 4.35479ms Response: \"rpc error: code = Unimplemented desc = unknown service logproto.Pusher\\n\" ws: false; Connection: close; Content-Length: 177219; Content-Type: application/x-protobuf; User-Agent: promtail/2.6.1; "