thanos-io / thanos

Highly available Prometheus setup with long term storage capabilities. A CNCF Incubating project.
https://thanos.io
Apache License 2.0
12.81k stars 2.04k forks source link

Max and min pointed at Sidecars not working on 0.35 #7368

Closed AlexDCraig closed 1 month ago

AlexDCraig commented 2 months ago

Thanos, Prometheus and Golang version used: docker.io/bitnami/thanos:0.35.0-debian-12-r4, sidecar version v0.34.1, prometheus version v2.50.1

Object Storage Provider: Azure

What happened:

Using Thanos Query, I can no longer use max() or min() operators and have it work with my sidecars. This is because the query going from Query -> Sidecar has fundamentally changed. For instance, when I run:

max(jvm_gc_pause_seconds_max{cluster="dev",` pod=~"podname.*"}) by (pod)

It yields this query on the sidecar:

[prometheus-k-prom-prometheus-operator-prometheus-0 thanos-sidecar] ts=2024-05-16T22:43:15.966821191Z caller=promclient.go:547 level=debug msg="range query" url="http://127.0.0.1:9090/api/v1/query_range?analyze=false&dedup=false&end=1715899348&engine=&explain=false&partial_response=true&query=max+by+%28pod%29+%28%7Bcluster%3D%22dev%22%2C+pod%3D~%22podname.%2A%22%2C+__name__%3D%22jvm_gc_pause_seconds_max%22%7D%29&start=1715877462&step=86"

This is new behavior. Using Thanos version 0.34, this log doesn't even appear. This logged query ^ will not load anything on the Sidecar because it has no "cluster" labels in there, that's added in transit.

This seems to happen with range queries like max() and min(). It doesn't happen with avg() or sum().

What you expected to happen:

Query can reach Sidecar, and just like versions past, it can load recent data from it and aggregate using max() or min().

How to reproduce it (as minimally and precisely as possible):

Use the versions above. Ship Prometheus data using a sidecar every 2hrs. Use external labels like cluster when shipping the data out. Notice the most recent 2hr data is missing when running a max() query, but data from object storage still loads.

Full logs to relevant components: The interesting log is shared above.

Anything else we need to know:

MichaHoffmann commented 2 months ago

Hey,

can you please share your configuration of the related components? Something is odd ~ thanos sidecars usually dont issue range queries.

AlexDCraig commented 2 months ago

@MichaHoffmann Just want to highlight on this that 0.34 with the same exact config doesn't have this problem. Here's what I'm supplying to the various components.

Query:

- args:
        - query
        - --log.level=info
        - --log.format=logfmt
        - --grpc-address=0.0.0.0:10901
        - --http-address=0.0.0.0:10902
        - --query.replica-label=prometheus_replica
        - --endpoint=dnssrv+_grpc._tcp.thanos-bitnami-storegateway-headless.thanos-bitnami.svc.cluster.local
        - --endpoint=t-1-0.thanos.mydomain.com:443
        - --endpoint=t-1-1.thanos.mydomain.com:443
        - --endpoint=s2-0.thanos.mydomain.com:443
        - --endpoint=s2-1.thanos.mydomain.com:443
        - --endpoint=p2-0.thanos.mydomain.com:443
        - --endpoint=p2-1.thanos.mydomain.com:443
        - --endpoint=ci-0.thanos.mydomain.com:443
        - --endpoint=d3-0.thanos.mydomain.com:443
        - --endpoint=d3-1.thanos.mydomain.com:443
        - --endpoint=pt-0.thanos.mydomain.com:443
        - --endpoint=pt-1.thanos.mydomain.com:443
        - --endpoint=ps-0.thanos.mydomain.com:443
        - --endpoint=ps-1.thanos.mydomain.com:443
        - --endpoint=ss-0.thanos.mydomain.com:443
        - --endpoint=ss-1.thanos.mydomain.com:443
        - --endpoint=i0.thanos.mydomain.com:443
        - --endpoint=i1.thanos.mydomain.com:443
        - --endpoint=cu-0.thanos.mydomain.com:443
        - --endpoint=cu-1.thanos.mydomain.com:443
        - --endpoint=ee-0.thanos.mydomain.com:443
        - --endpoint=ee-1.thanos.mydomain.com:443
        - --endpoint=l2-0.thanos.mydomain.com:443
        - --endpoint=l2-1.thanos.mydomain.com:443
        - --endpoint=dnssrv+_grpc._tcp.thanos-receiver-headless.thanos-receiver.svc.cluster.local
        - --alert.query-url=https://thanos-query-frontend-bitnami.mydomain.com
        - --query.auto-downsampling
        - --grpc-client-tls-secure
        - --grpc-client-tls-skip-verify
        - --grpc-client-tls-cert=/etc/certs/client.crt
        - --grpc-client-tls-key=/etc/certs/client.key
        - --grpc-client-tls-ca=/etc/certs/ca.crt

Query Frontend:

- args:
        - query-frontend
        - --log.level=info
        - --log.format=logfmt
        - --http-address=0.0.0.0:9090
        - --query-frontend.downstream-url=http://thanos-bitnami-query:9090
        - --query-range.split-interval=12h
        - --query-frontend.compress-responses
        - |
          --query-range.response-cache-config=
          type: IN-MEMORY
          config:
            max_size: 2GB

Store:

- args:
        - store
        - --log.level=info
        - --log.format=logfmt
        - --grpc-address=0.0.0.0:10901
        - --http-address=0.0.0.0:10902
        - --data-dir=/data
        - --objstore.config-file=/conf/objstore.yml
        - --sync-block-duration=3m
        - --grpc-server-tls-cert=/etc/certs/server.crt
        - --grpc-server-tls-key=/etc/certs/server.key
        - --grpc-server-tls-client-ca=/etc/certs/ca.crt

Let me know if this sufficient, or there's more config you'd like to see. Thanks!

MichaHoffmann commented 2 months ago

Can you also please share the configuration of the sidecar that is logging the error?

AlexDCraig commented 2 months ago

Sidecar:

- args:
        - sidecar
        - --prometheus.url=http://127.0.0.1:9090/
        - '--prometheus.http-client={"tls_config": {"insecure_skip_verify":true}}'
        - --grpc-address=:10901
        - --http-address=:10902
        - --objstore.config=$(OBJSTORE_CONFIG)
        - --tsdb.path=/prometheus
        - --log.level=debug
        - --log.format=logfmt
MichaHoffmann commented 2 months ago

The thing that is really weird to me is that the only thing that really runs that code (QueryRange from promclient.go) is the thanos ruler but the log statement you passed indicates that its from a container named "thanos-sidecar". Do you by chance run a ruler too?

MichaHoffmann commented 2 months ago

Could it be that some sidecars are on a version pre 0.34.0 ? And use the queryPushdown feature? We removed all raw promql queries from sidecars in https://github.com/thanos-io/thanos/pull/7014/commits/f29b338cd9d885c17944a448419e3a58d5a573a7

AlexDCraig commented 2 months ago

@MichaHoffmann No, all Thanos sidecars are version 0.34.1:

- --thanos-default-base-image=quay.io/thanos/thanos:v0.34.1

Also, we don't use Thanos Ruler. or at least, we don't have a Thanos Ruler deployment running, or intend to. The Thanos sidecars on the remote clusters are configured in the Prometheus Operator.

MichaHoffmann commented 2 months ago

Sorry, it changes nothing but just to correct myself: that change was released in 0.34.1. The only way I could understand this is if you would run a sidecar with version before 0.34.1. Something is pretty weird here; can you spot check the Thanos version of that sidecar that logs that line maybe just to be extra sure?

AlexDCraig commented 2 months ago
k get pod prometheus-k-prom-prometheus-operator-prometheus-0 -n monitoring -o yaml

apiVersion: v1
kind: Pod
metadata:
  name: prometheus-k-prom-prometheus-operator-prometheus-0
  namespace: monitoring
spec:
  containers:
 ...
  - args:
    - sidecar
    - --prometheus.url=http://127.0.0.1:9090/
    - '--prometheus.http-client={"tls_config": {"insecure_skip_verify":true}}'
    - --grpc-address=:10901
    - --http-address=:10902
    - --objstore.config=$(OBJSTORE_CONFIG)
    - --tsdb.path=/prometheus
    - --log.level=debug
    - --log.format=logfmt
    image: quay.io/thanos/thanos:v0.34.1
    imagePullPolicy: IfNotPresent
    name: thanos-sidecar
MichaHoffmann commented 2 months ago

I mean can you run something like "Thanos --version" inside the container?

AlexDCraig commented 2 months ago

Certainly:

k exec -it prometheus-k-prom-prometheus-operator-prometheus-0 -c thanos-sidecar -n monitoring -- /bin/sh

~ $ thanos --version
thanos, version 0.34.1 (branch: HEAD, revision: 4cf1559998bf6d8db3f9ca0fde2a00d217d4e23e)
  build user:       root@61db75277a55
  build date:       20240219-17:13:48
  go version:       go1.21.7
  platform:         linux/amd64
  tags:             netgo
pvlltvk commented 2 months ago

Hi guys! I can confirm that we have the same issue in our environment. We use 0.34.0 for a sidecar and after upgrading Thanos Query to 0.35.0 min/max operators don't work in the same way as @AlexDCraig described

MichaHoffmann commented 2 months ago

@pvlltvk does it work again if you upgrade sidecars?

pvlltvk commented 1 month ago

@MichaHoffmann Yes, I can confirm that with after sidecar upgrade to 0.35.0 it works again

MichaHoffmann commented 1 month ago

@MichaHoffmann Yes, I can confirm that with after sidecar upgrade to 0.35.0 it works again

Awesome, thanks for confirming