grafana / tempo

Grafana Tempo is a high volume, minimal dependency distributed tracing backend.
https://grafana.com/oss/tempo/
GNU Affero General Public License v3.0
3.91k stars 510 forks source link

Memberlist communication failing constantly #927

Closed Kampe closed 1 year ago

Kampe commented 3 years ago

Hi, was told to open another issue since this is persisting, but this stems from https://github.com/grafana/tempo/issues/848

Following up, there's no firewall rule blocking 7946, in fact I've even opened this port directly to the master nodes within GKE explicitly. This doesn't alleviate the issue.

Logs from our ingesters:

level=info ts=2021-09-01T23:41:56.916380772Z caller=main.go:190 msg="initialising OpenTracing tracer"
level=info ts=2021-09-01T23:41:56.941919286Z caller=main.go:109 msg="Starting Tempo" version="(version=, branch=HEAD, revision=e586b3380)"
level=info ts=2021-09-01T23:41:56.943079715Z caller=server.go:239 http=[::]:3200 grpc=[::]:9095 msg="server listening on addresses"
level=info ts=2021-09-01T23:41:56.951774106Z caller=module_service.go:59 msg=initialising module=overrides
level=info ts=2021-09-01T23:41:56.952164903Z caller=module_service.go:59 msg=initialising module=server
level=info ts=2021-09-01T23:41:56.952416313Z caller=module_service.go:59 msg=initialising module=memberlist-kv
level=info ts=2021-09-01T23:41:56.951759661Z caller=memberlist_client.go:392 msg="Using memberlist cluster node name" name=tempo-ingester-1-5379fb05
level=info ts=2021-09-01T23:41:56.952075909Z caller=module_service.go:59 msg=initialising module=store
level=info ts=2021-09-01T23:41:56.953824693Z caller=module_service.go:59 msg=initialising module=ingester
level=info ts=2021-09-01T23:41:56.95397702Z caller=ingester.go:321 msg="beginning wal replay"
level=info ts=2021-09-01T23:41:56.954251028Z caller=wal.go:93 msg="beginning replay" file=784eb654-00e1-4b56-a0df-4db0fe81ffd5:single-tenant:v2:none:v1 size=83885
level=info ts=2021-09-01T23:41:56.973922505Z caller=memberlist_client.go:504 msg="joined memberlist cluster" reached_nodes=1
level=info ts=2021-09-01T23:41:56.974403771Z caller=wal.go:120 msg="replay complete" file=784eb654-00e1-4b56-a0df-4db0fe81ffd5:single-tenant:v2:none:v1 duration=20.152642ms
level=info ts=2021-09-01T23:41:56.97548801Z caller=ingester.go:345 msg="wal replay complete"
level=info ts=2021-09-01T23:41:56.98895643Z caller=ingester.go:359 msg="reloading local blocks" tenants=1
level=info ts=2021-09-01T23:41:57.003150026Z caller=instance.go:495 msg="reloaded local block" tenantID=single-tenant block=26c7e1d6-440e-482e-a825-5cd040aa49e6 flushed=2021-09-01T23:38:21Z
level=info ts=2021-09-01T23:41:57.003353917Z caller=app.go:251 msg="Tempo started"
level=info ts=2021-09-01T23:41:57.016440932Z caller=lifecycler.go:591 msg="existing entry found in ring" state=ACTIVE tokens=128 ring=ingester
level=info ts=2021-09-01T23:41:59.060336552Z caller=flush.go:230 msg="completing block" userid=single-tenant blockID=784eb654-00e1-4b56-a0df-4db0fe81ffd5
level=info ts=2021-09-01T23:41:59.069239668Z caller=flush.go:237 msg="block completed" userid=single-tenant blockID=784eb654-00e1-4b56-a0df-4db0fe81ffd5 duration=8.905388ms
level=info ts=2021-09-01T23:41:59.070342999Z caller=flush.go:286 msg="flushing block" userid=single-tenant block=784eb654-00e1-4b56-a0df-4db0fe81ffd5
level=warn ts=2021-09-01T23:43:16.504507812Z caller=tcp_transport.go:417 msg="TCPTransport: WriteTo failed" addr=10.80.9.115:7946 err="dial tcp 10.80.9.115:7946: i/o timeout"
level=warn ts=2021-09-01T23:43:16.956429621Z caller=tcp_transport.go:417 msg="TCPTransport: WriteTo failed" addr=10.80.9.115:7946 err="dial tcp 10.80.9.115:7946: i/o timeout"
ts=2021-09-01T23:43:16.960604947Z caller=memberlist_logger.go:74 level=info msg="Suspect tempo-ingester-0-79089292 has failed, no acks received"
level=warn ts=2021-09-01T23:43:17.956781139Z caller=tcp_transport.go:417 msg="TCPTransport: WriteTo failed" addr=10.80.9.115:7946 err="dial tcp 10.80.9.115:7946: i/o timeout"
ts=2021-09-01T23:43:20.96099367Z caller=memberlist_logger.go:74 level=warn msg="Was able to connect to tempo-querier-77d8f74c65-nmfxh-a22e4983 but other probes failed, network may be misconfigured"
level=warn ts=2021-09-01T23:43:21.506394611Z caller=tcp_transport.go:417 msg="TCPTransport: WriteTo failed" addr=10.80.9.115:7946 err="dial tcp 10.80.9.115:7946: i/o timeout"
ts=2021-09-01T23:43:23.961516925Z caller=memberlist_logger.go:74 level=warn msg="Was able to connect to tempo-distributor-5fcd49778f-ctmr9-2c25ec86 but other probes failed, network may be misconfigured"
ts=2021-09-01T23:43:25.961862542Z caller=memberlist_logger.go:74 level=warn msg="Was able to connect to tempo-distributor-5fcd49778f-k6pbr-015e9cee but other probes failed, network may be misconfigured"
level=warn ts=2021-09-01T23:43:26.507620191Z caller=tcp_transport.go:417 msg="TCPTransport: WriteTo failed" addr=10.80.9.115:7946 err="dial tcp 10.80.9.115:7946: i/o timeout"
ts=2021-09-01T23:43:26.963058279Z caller=memberlist_logger.go:74 level=warn msg="Was able to connect to tempo-distributor-5fcd49778f-btzbj-71f3c44e but other probes failed, network may be misconfigured"
ts=2021-09-01T23:43:27.964445179Z caller=memberlist_logger.go:74 level=warn msg="Was able to connect to tempo-compactor-8574bdf84-95jj5-8df1f627 but other probes failed, network may be misconfigured"
ts=2021-09-01T23:43:28.966357922Z caller=memberlist_logger.go:74 level=error msg="Failed fallback ping: read tcp 10.80.2.140:46286->10.80.6.58:7946: i/o timeout"
ts=2021-09-01T23:43:28.966527529Z caller=memberlist_logger.go:74 level=info msg="Suspect tempo-compactor-8574bdf84-95jj5-8df1f627 has failed, no acks received"
level=warn ts=2021-09-01T23:43:31.513354974Z caller=tcp_transport.go:417 msg="TCPTransport: WriteTo failed" addr=10.80.9.115:7946 err="dial tcp 10.80.9.115:7946: i/o timeout"
ts=2021-09-01T23:43:31.525419572Z caller=memberlist_logger.go:74 level=info msg="Marking tempo-distributor-5fcd49778f-k6pbr-015e9cee as failed, suspect timeout reached (2 peer confirmations)"
level=warn ts=2021-09-01T23:43:38.956246873Z caller=tcp_transport.go:417 msg="TCPTransport: WriteTo failed" addr=10.80.9.115:7946 err="dial tcp 10.80.9.115:7946: i/o timeout"
level=warn ts=2021-09-01T23:43:47.955989224Z caller=tcp_transport.go:417 msg="TCPTransport: WriteTo failed" addr=10.80.9.115:7946 err="dial tcp 10.80.9.115:7946: i/o timeout"
ts=2021-09-01T23:45:13.95622907Z caller=memberlist_logger.go:74 level=error msg="Failed fallback ping: read tcp 10.80.2.140:53710->10.80.6.67:7946: i/o timeout"
ts=2021-09-01T23:45:13.956311127Z caller=memberlist_logger.go:74 level=info msg="Suspect tempo-distributor-5fcd49778f-k6pbr-015e9cee has failed, no acks received"
ts=2021-09-01T23:49:13.956038122Z caller=memberlist_logger.go:74 level=error msg="Failed fallback ping: read tcp 10.80.2.140:40340->10.80.6.66:7946: i/o timeout"
ts=2021-09-01T23:49:13.956157787Z caller=memberlist_logger.go:74 level=info msg="Suspect tempo-querier-77d8f74c65-8n7kj-f5a55181 has failed, no acks received"
ts=2021-09-01T23:51:29.216400017Z caller=memberlist_logger.go:74 level=warn msg="Refuting a suspect message (from: tempo-querier-77d8f74c65-8n7kj-f5a55181)"
ts=2021-09-01T23:52:13.955623524Z caller=memberlist_logger.go:74 level=error msg="Failed fallback ping: read tcp 10.80.2.140:42724->10.80.6.67:7946: i/o timeout"
ts=2021-09-01T23:53:03.956326595Z caller=memberlist_logger.go:74 level=error msg="Failed fallback ping: read tcp 10.80.2.140:45178->10.80.6.67:7946: i/o timeout"
ts=2021-09-01T23:53:03.95726031Z caller=memberlist_logger.go:74 level=info msg="Suspect tempo-distributor-5fcd49778f-k6pbr-015e9cee has failed, no acks received"

The service for the gossip ring is defined as:

apiVersion: v1
kind: Service
metadata:
  name: gossip-ring
  namespace: tracing
spec:
  sessionAffinity: None
  type: ClusterIP
  ports:
  - name: gossip-ring
    port: 7946
    protocol: TCP
    targetPort: 7946
  selector:
    tempo-gossip-member: "true"

More interesting, the IPs that the memberlist is attempting to contact do not exist. we saw some similar behavior with Loki a while back, but this was fixed by allowing the distriutors to auto forget unhealthy nodes with the recent release allowing for the -ingester.autoforget-unhealthy=true flag.

Could the same fix solve the issue in this situation?

Originally posted by @Kampe in https://github.com/grafana/tempo/issues/848#issuecomment-910885641

Kampe commented 3 years ago

Ok, apparently you should not run on the same ports as what you run loki's memberlist on.

Kampe commented 3 years ago

Actually, after setting the service back to ClusterIP: none we see the same behavior. More context: we're running in GKE 1.20 on private nodes.

Seeing this in the logs for the querier sidecar that lives on the query-frontend for it's headless comms:

"[transport]transport: http2Server.HandleStreams failed to read frame: read tcp 127.0.0.1:16685->127.0.0.1:45282: use of closed network connection"
joe-elliott commented 3 years ago

in fact I've even opened this port directly to the master nodes within GKE explicitly

You shouldn't need to open ports on nodes for this to work. The different processes just need to be able to communicate via UDP on the memberlist port. I would recommend closing any ports you've opened on your nodes directly and allow the pods to communicate over k8s networking.

More interesting, the IPs that the memberlist is attempting to contact do not exist. we saw some similar behavior with Loki a while back, but this was fixed by allowing the distriutors to auto forget unhealthy nodes with the recent release allowing for the -ingester.autoforget-unhealthy=true flag.

Auto forgetting ingesters shouldn't matter. These are memberlist errors which is controlled by the memberlist cluster settings. There is a setting somewhere in there to more aggressively forget.

Are your rings healthy? Is Tempo working? When are you seeing these errors? We see stuff like this on rollout but everything continues to work fine. These errors may just be normal memberlist chatter as pods drop out on rollout.

Kampe commented 3 years ago

tempo is not working at all, the rings never reach a healthy state, vulture receives nothing but 404's from the frontends unfortunately. We see these errors constantly, tempo never makes it to a healthy state ever :(

joe-elliott commented 3 years ago

if you do a kubectl describe svc how many endpoints are listed in the svc? it should be all distributors, ingesters, compactors and queriers.

from within the network namespace of one of the containers can you connect to all other containers on 7946? It seems you have some intense network issues. Try debugging tcp level connections between all of the pods.

Kampe commented 3 years ago

Nothing strange in terms of our networking aside from we use GKE private nodes. We run a fully exploded version of Loki and Thanos as well and both function fine (although with similar timeout issues in the case of loki, just not as often and still functions) Not sure what's different here..

Name:              gossip-ring
Namespace:         tracing
Labels:            app.kubernetes.io/instance=tracing
Annotations:       <none>
Selector:          tempo-gossip-member=true
Type:              ClusterIP
IP Families:       <none>
IP:                10.48.10.66
IPs:               10.48.10.66
Port:              gossip-ring  7946/TCP
TargetPort:        7946/TCP
Endpoints:         10.44.0.17:7946,10.44.0.21:7946,10.44.0.22:7946 + 7 more...
Session Affinity:  None
Events:            <none>
mvkrishna86 commented 2 years ago

I am also facing this error on EKS. Some of the traces are missing, could be because of this error:

level=error ts=2021-09-29T13:19:21.607323606Z caller=tcp_transport.go:313 msg="TCPTransport: unknown message type" msgType=G
ts=2021-09-29T13:19:24.799841435Z caller=memberlist_logger.go:74 level=warn msg="Refuting a suspect message (from: tempo-compactor-595888cf59-fdtv5-77c967f7)"
ts=2021-09-29T13:19:26.999147708Z caller=memberlist_logger.go:74 level=warn msg="Was able to connect to tempo-compactor-595888cf59-fdtv5-77c967f7 but other probes failed, network may be misconfigured"
ts=2021-09-29T13:19:28.000058002Z caller=memberlist_logger.go:74 level=warn msg="Was able to connect to tempo-querier-847b96df59-tmcr9-b5377673 but other probes failed, network may be misconfigured"
ts=2021-09-29T13:19:29.00087064Z caller=memberlist_logger.go:74 level=warn msg="Was able to connect to tempo-ingester-2-c207c4cb but other probes failed, network may be misconfigured"
ts=2021-09-29T13:19:30.001360547Z caller=memberlist_logger.go:74 level=warn msg="Was able to connect to tempo-ingester-0-1790077c but other probes failed, network may be misconfigured"
ts=2021-09-29T13:19:30.956657176Z caller=memberlist_logger.go:74 level=warn msg="Refuting a suspect message (from: tempo-compactor-595888cf59-fdtv5-77c967f7)"
ts=2021-09-29T13:19:31.002582643Z caller=memberlist_logger.go:74 level=warn msg="Was able to connect to tempo-distributor-6dbd749d86-xfgm4-b79faabc but other probes failed, network may be misconfigured"
ts=2021-09-29T13:19:34.999175631Z caller=memberlist_logger.go:74 level=warn msg="Was able to connect to tempo-distributor-6dbd749d86-xfgm4-b79faabc but other probes failed, network may be misconfigured"
ts=2021-09-29T13:19:36.01541191Z caller=memberlist_logger.go:74 level=info msg="Suspect tempo-ingester-2-c207c4cb has failed, no acks received"
ts=2021-09-29T13:19:39.336431026Z caller=memberlist_logger.go:74 level=warn msg="Refuting a suspect message (from: tempo-distributor-6dbd749d86-xfgm4-b79faabc)"
level=error ts=2021-09-29T13:19:45.814285282Z caller=tcp_transport.go:313 msg="TCPTransport: unknown message type" msgType=G
ts=2021-09-29T13:19:47.177893684Z caller=memberlist_logger.go:74 level=warn msg="Refuting a suspect message (from: tempo-compactor-595888cf59-fdtv5-77c967f7)"
ts=2021-09-29T13:19:50.998215926Z caller=memberlist_logger.go:74 level=info msg="Suspect tempo-ingester-2-c207c4cb has failed, no acks received"
level=error ts=2021-09-29T13:19:51.615225245Z caller=tcp_transport.go:313 msg="TCPTransport: unknown message type" msgType=G
ts=2021-09-29T13:20:03.291601219Z caller=memberlist_logger.go:74 level=warn msg="Refuting a suspect message (from: tempo-ingester-1-e33048a2)"
ts=2021-09-29T13:20:10.027464106Z caller=memberlist_logger.go:74 level=info msg="Suspect tempo-distributor-6dbd749d86-xfgm4-b79faabc has failed, no acks received"
ts=2021-09-29T13:20:12.803481918Z caller=memberlist_logger.go:74 level=warn msg="Refuting a suspect message (from: tempo-ingester-2-c207c4cb)"
level=error ts=2021-09-29T13:20:15.814699677Z caller=tcp_transport.go:313 msg="TCPTransport: unknown message type" msgType=G
ts=2021-09-29T13:20:16.014402937Z caller=memberlist_logger.go:74 level=info msg="Suspect tempo-ingester-1-e33048a2 has failed, no acks received"
level=error ts=2021-09-29T13:20:21.608440296Z caller=tcp_transport.go:313 msg="TCPTransport: unknown message type" msgType=G
ts=2021-09-29T13:20:45.998778023Z caller=memberlist_logger.go:74 level=warn msg="Was able to connect to tempo-querier-847b96df59-tmcr9-b5377673 but other probes failed, network may be misconfigured"
level=error ts=2021-09-29T13:20:46.830010012Z caller=tcp_transport.go:313 msg="TCPTransport: unknown message type" msgType=G
level=error ts=2021-09-29T13:20:51.60530996Z caller=tcp_transport.go:313 msg="TCPTransport: unknown message type" msgType=G
ts=2021-09-29T13:20:58.800716431Z caller=memberlist_logger.go:74 level=warn msg="Refuting a suspect message (from: tempo-ingester-0-1790077c)"
ts=2021-09-29T13:21:10.998845028Z caller=memberlist_logger.go:74 level=warn msg="Was able to connect to tempo-distributor-6dbd749d86-p769f-882ca03e but other probes failed, network may be misconfigured"
ts=2021-09-29T13:21:14.063398896Z caller=memberlist_logger.go:74 level=info msg="Suspect tempo-compactor-595888cf59-fdtv5-77c967f7 has failed, no acks received"
level=error ts=2021-09-29T13:21:15.810340014Z caller=tcp_transport.go:313 msg="TCPTransport: unknown message type" msgType=G
ts=2021-09-29T13:21:19.594754454Z caller=memberlist_logger.go:74 level=warn msg="Refuting a suspect message (from: tempo-ingester-2-c207c4cb)"
level=error ts=2021-09-29T13:21:21.605252769Z caller=tcp_transport.go:313 msg="TCPTransport: unknown message type" msgType=G
ts=2021-09-29T13:21:26.998418309Z caller=memberlist_logger.go:74 level=warn msg="Was able to connect to tempo-querier-847b96df59-tmcr9-b5377673 but other probes failed, network may be misconfigured"
ts=2021-09-29T13:21:33.802067923Z caller=memberlist_logger.go:74 level=warn msg="Refuting a suspect message (from: tempo-distributor-6dbd749d86-p769f-882ca03e)"
ts=2021-09-29T13:21:41.019170698Z caller=memberlist_logger.go:74 level=info msg="Suspect tempo-distributor-6dbd749d86-p769f-882ca03e has failed, no acks received"
joe-elliott commented 2 years ago

These errors should not impact trace ingestion, storage or querying. They are transient spam from memberlist that we need to better understand/clean up. We see these errors regularly on rollout or if a member of the Tempo cluster OOMs. They are ignorable.

rafaribe commented 2 years ago

Following because i'm getting the same warnings.

muneeb-dar commented 2 years ago

Was anyone, who follows this thread, able to figure out the root of the issue? I am getting the same error and my traces are getting lost.

joe-elliott commented 2 years ago

Memberlist spams these error messages in our clusters as well. It is a best effort communication protocol and the result is a lot of "error" spam. These errors do not necessarily mean something is wrong in Tempo. Some things to check:

This value should generally be 0.

max(tempo_memberlist_client_cluster_node_health_score{})

This metric should be equal to the number of ingesters in your cluster. If one of your querier's is showing a lower value it might not see every ingester and this would potentially indicate a memberlist/ring issue.

tempo_querier_ingester_clients{container="querier"}

This metric shows you the "oldest" active member of a ring in seconds. A large value means that memberlist is being slow about propagating the ring state. We regularly see values in the 30s range w/o issue.

max(time() - cortex_ring_oldest_member_timestamp{state="ACTIVE"})

Visit your various ring pages. If you check the /ingester/ring page on the distributor and see all the ingesters listed with a reasonable last heartbeat time then things are working as intended.

Although some may be having real memberlist issues, my guess is that for most operators this is red herring. If all of the above signs indicate your cluster is healthy and you feel you are missing traces we should probably be considering other possibilities.

EDIT: One final thought: We do see the above error logs commonly, but they are not continuous. We primarily see them during rollouts of the cluster. If you are seeing these errors continuously it might also indicate connectivity issue between pods.

alexandre1984rj commented 1 year ago

I'm having the same issue after upgrading tempo-distributed chart v0.17.2 to v0.19.3 in order to enable metrics-generator. After some debugging I realized that it returns 404 page not found after trying to access the ring from metrics-generator, ingester and distributor. The only ring avaiable is from compactor. https://grafana.com/docs/tempo/latest/api_docs/#endpoints

joe-elliott commented 1 year ago

The ring endpoints are not available on all components. A list of ring endpoints and the components they are hosted on can be found here:

https://grafana.com/docs/tempo/latest/operations/consistent_hash_ring/#interacting-with-the-rings

alexandre1984rj commented 1 year ago

Hi @joe-elliott, thank you for your valuable feedback. After port-forwarding the distributor service I have checked again and realized that I can access the ring from metrics-generator and ingester but I'm still getting some error 404 page not found from distributor. I'm using the tempo-distributed chart v0.19.3 and checking the config I have noticed that from this link there is no ring part config for distributor differently from the values.yaml chart =/ That is how my configuration is set:

multitenancy_enabled: false
search_enabled: true
metrics_generator_enabled: true
compactor:
  compaction:
    block_retention: 48h
  ring:
    kvstore:
      store: memberlist
metrics_generator:
  ring:
    kvstore:
      store: memberlist
  processor:
    service_graphs:
      max_items: 10000
  storage:
    path: /var/tempo/wal
    remote_write:
      - send_exemplars: true
        url: https://<prometheus.url>/api/v1/write
distributor:
  ring:
    kvstore:
      store: memberlist
  receivers:
    otlp:
      protocols:
        grpc:
          endpoint: 0.0.0.0:4317
querier:
  frontend_worker:
    frontend_address: observability-tempo-tempo-distributed-query-frontend-discovery:9095
ingester:
  lifecycler:
    ring:
      replication_factor: 1
      kvstore:
        store: memberlist
    tokens_file_path: /var/tempo/tokens.json
memberlist:
  abort_if_cluster_join_fails: false
  join_members:
    - observability-tempo-tempo-distributed-gossip-ring
overrides:
  max_search_bytes_per_trace: 20000
  metrics_generator_processors:
  - service-graphs
  - span-metrics
  per_tenant_override_config: /conf/overrides.yaml
server:
  http_listen_port: 3100
  log_level: info
  log_format: logfmt
  grpc_server_max_recv_msg_size: 4.194304e+06
  grpc_server_max_send_msg_size: 4.194304e+06
storage:
  trace:
    backend: s3
    s3:
      bucket: <s3.bucket>
      endpoint: <s3.endpoint>
    blocklist_poll: 5m
    local:
      path: /var/tempo/traces
    wal:
      path: /var/tempo/wal
    cache: memcached
    memcached:
      consistent_hash: true
      host: observability-tempo-tempo-distributed-memcached
      service: memcached-client
      timeout: 500ms
alexandre1984rj commented 1 year ago

Hi @joe-elliott just to let you know that my problem was solved after updating the ingestion_rate_strategy https://grafana.com/docs/tempo/latest/operations/consistent_hash_ring/#:~:text=This%20ring%20is%20only%20used%20when%20%E2%80%9Cglobal%E2%80%9D%20rate%20limits%20are%20used Really appreciated for your time and effort to help ;)

github-actions[bot] commented 1 year ago

This issue has been automatically marked as stale because it has not had any activity in the past 60 days. The next time this stale check runs, the stale label will be removed if there is new activity. The issue will be closed after 15 days if there is no new activity. Please apply keepalive label to exempt this Issue.