thanos-io / thanos

Highly available Prometheus setup with long term storage capabilities. A CNCF Incubating project.
https://thanos.io
Apache License 2.0
13.09k stars 2.1k forks source link

receive: wasting CPU on GC #7016

Open dctrwatson opened 10 months ago

dctrwatson commented 10 months ago

Thanos, Prometheus and Golang version used:

thanos, version 0.32.5 (branch: HEAD, revision: 750e8a94eed5226cd4562117295d540a968c163c)
  build user:       root@053ebc7b5322
  build date:       20231019-04:13:41
  go version:       go1.21.3
  platform:         linux/amd64
  tags:             netgo

Object Storage Provider: s3

What happened: after running for some time, receive pegs CPU.

What you expected to happen: CPU usage to be proportional to write load

How to reproduce it (as minimally and precisely as possible):

Full logs to relevant components:

Anything else we need to know: https://pprof.me/aa81313c5472bcec2e81765384e11748

GiedriusS commented 10 months ago

What options do you have enabled? Have you tried --writer.intern?

fpetkovski commented 10 months ago

To add to that, would be great to see the full configuration of the receiver including flags and hashring. It looks like it's stuck in GC so I wonder if there is a routing loop.

dctrwatson commented 10 months ago
receive
--log.level=info
--log.format=logfmt
--grpc-address=0.0.0.0:10901
--http-address=0.0.0.0:10902
--remote-write.address=0.0.0.0:19291
--objstore.config=$(OBJSTORE_CONFIG)
--tsdb.path=/var/thanos/receive
--label=thanos_receive_replica="$(NAME)"
--label=receive="true"
--tsdb.retention=26h
--receive.local-endpoint=$(NAME).thanos-receive-headless.$(NAMESPACE).svc.cluster.local.:10901
--grpc-server-tls-cert=/cert/tls.crt
--grpc-server-tls-key=/cert/tls.key
--grpc-server-tls-client-ca=/cert/ca.crt
--label=metrics_namespace="global"
--receive.tenant-label-name=cluster
--receive.default-tenant-id=unknown
--receive.hashrings-file-refresh-interval=1m
--remote-write.server-tls-cert=/cert/tls.crt
--remote-write.server-tls-client-ca=/cert/ca.crt
--remote-write.server-tls-key=/cert/tls.key
--tsdb.memory-snapshot-on-shutdown
--tsdb.max-block-duration=1h
--tsdb.min-block-duration=1h
--writer.intern

We're running with distributor:

receive
--log.level=info
--log.format=logfmt
--grpc-address=0.0.0.0:10901
--http-address=0.0.0.0:10902
--remote-write.address=0.0.0.0:19291
--label=replica="$(NAME)"
--label=receive="true"
--receive.hashrings-file=/var/lib/thanos-receive/hashrings.json
--receive.replication-factor=1
--grpc-server-tls-cert=/cert/tls.crt
--grpc-server-tls-key=/cert/tls.key
--grpc-server-tls-client-ca=/cert/ca.crt
--receive.grpc-compression=snappy
--receive-forward-timeout=30s
--receive.hashrings-algorithm=ketama
--receive.hashrings-file-refresh-interval=1m
--receive.relabel-config=$(RECEIVE_RELABEL_CONFIG)
--receive.tenant-label-name=cluster
--receive.default-tenant-id=unknown
--remote-write.client-tls-ca=/cert/ca.crt
--remote-write.client-tls-cert=/cert/tls.crt
--remote-write.client-tls-key=/cert/tls.key
--remote-write.client-server-name=thanos-receive-headless.monitoring.svc.cluster.local
--remote-write.server-tls-cert=/cert/tls.crt
--remote-write.server-tls-client-ca=/cert/ca.crt
--remote-write.server-tls-key=/cert/tls.key

The hashring is managed by https://github.com/observatorium/thanos-receive-controller

fpetkovski commented 10 months ago

I cannot see anything wrong in the configuration. Maybe you can take a look at an allocation profile to see where objects are being allocated.

MichaHoffmann commented 9 months ago

Is the problematic one the router or the ingester?

dctrwatson commented 9 months ago

Is the problematic one the router or the ingester?

Ingester

GiedriusS commented 1 week ago

Please try out the capnproto replication available on main.