Query via Thanos causes Prometheus to OOM

tdabasinskas commented 6 years ago

Thanos, Prometheus and Golang version used

Thanos components: improbable/thanos:master-2018-07-27-ecfce89
Prometheus: prom/prometheus:v2.3.2

What happened

The query provided below timeouts (after the default 2 minutes query timeout), but before it does that, Prometheus gets OOM.

What you expected to happen

Either the query to complete successfully, or timeout (assuming it's too complex) without bringing Prometheus down.

How to reproduce it (as minimally and precisely as possible):

We are using the following command to perform the query:

curl "https://thanosquery/api/v1/query_range?query=label_replace(%0A%20%20histogram_quantile(%0A%20%20%20%200.999%2C%0A%20%20%20%20sum(%20%20%20%20%20%20rate(join_joined_left_vs_right_timestamp_diff_bucket%7Bpartition%3D~%220%7C1%7C2%7C3%7C4%7C5%7C6%7C7%7C8%7C18%7C19%7C20%7C21%7C22%7C23%7C24%7C25%7C26%7C27%7C28%7C29%7C30%7C31%7C32%7C33%7C34%7C35%22%2Cmarathon_app%3D~%22%2Fsome%2Fapp%2Fregion%2Fjoin.*%22%7D%5B5m%5D)%0A%20%20%20%20)%20by%20(le%2C%20marathon_app%2C%20partition)%0A%20%20)%2C%20%0A%20%20%22region%22%2C%20%22%241%22%2C%20%22marathon_app%22%2C%20%22%2Fsome%2Fapp%2F(.*)%2Fjoin.*%22%0A)&start=1532343000&end=1532948400&step=600"

Running this once or twice, always brings our nodes down. If we run the same query directly against Prometheus (not via Thanos Query), it completes quite fast successfully (we store 24h data on Prometheus).

Anything else we need to know

We are running independent Docker hosts with the following containers:

Prometheus
Thanos Query
Thanos Store
Thanos Sidecar

Thanos Compactor is running independently on a different host. We also try running Thanos Store on a different host than other containers.

One thing to note, that if we have multiple (let's say have 4 nodes) this configuration and put Thanos Query under an Nginx-based load balanced, a query to one of the Thanos Query instances brings all 4 hosts down (due to all 4 Promehetues getting OOM).

Environment:

OS: CentOS Linux 7 (Core)
Kernel: 3.10.0-693.21.1.el7.x86_64
Docker: 17.12.0-ce, build c97c6d6
Memory: 32GB

Storage:

S3 size: 1363 GB
S3 objects: 5338

We are not sure whether the issue is caused by Prometheus or Thanos. Maybe it's somehow directly related to our setup (for example join_joined_left_vs_right_timestamp_diff_bucket metric having a huge number of different labels, which results into a large number of different timeseries that cannot be handled when running the mentioned query). Anyhow, any guidance or tips would be really appreciated.

bwplotka commented 6 years ago

Ok, this is very interesting. Especially that part for querying Prometheus directly and through thanos-query and having OOM while doing for query.

So you are saying that Prometheus OOMs.. Can you isolate that setup and turn off store and all sidecars + Proms but one and try to repro?

Questions:

Is it sidecar OOMing or Prometheus binary OOMing? If the latter.. sidecar is fine for that moment?
What exactly numbers for memory we are talking about. How much is available for Prometheus and how much it uses for query without Thanos-query

One fact that matters here: Sidecar (so queries via Thanos-query) is using Prometheus remote_read which uses protobufs. For direct Prometheus queries you use just HTTP endpoint, so maybe there is some bug on remote_read that causes huge mem spike? Worth to look on Prometheus pprof heap as well.

tdabasinskas commented 6 years ago

Hi, @Bplotka,

Yes, the memory Usage of Prometheus is increasing until it OOMs
In this exact test, for which the details are provided below, we limited the memory to Prometheus and Thanos-Sidecar containers to 12GB, leaving 40GB for Thanos-Store and 20GB for Thanos-Query.

I'm attaching SVG heap traces from PPROF tool , hoping there's something suspicious visible there.

I'm also including docker stats output below, to better understand what memory limits we had and how memory increased:

With Thanos

Startup:

CONTAINER ID        NAME                CPU %               MEM USAGE / LIMIT     MEM %               NET I/O             BLOCK I/O           PIDS
d219ebcc6a85        thanos-store        0.24%               3.288GiB / 40GiB      8.22%               0B / 0B             0B / 0B             20
703d0f88ac95        thanos-query        0.23%               19.98MiB / 20GiB      0.10%               0B / 0B             0B / 0B             18
52de1166d541        thanos-sidecar      0.26%               13.05MiB / 12GiB      0.11%               0B / 0B             0B / 18.4kB         19
2ad19964cf43        prometheus          39.91%              4.554GiB / 12GiB      37.95%              0B / 0B             0B / 32.4MB         31

1st query:
CONTAINER ID        NAME                CPU %               MEM USAGE / LIMIT     MEM %               NET I/O             BLOCK I/O           PIDS
d219ebcc6a85        thanos-store        0.29%               7.399GiB / 40GiB      18.50%              0B / 0B             0B / 0B             22
703d0f88ac95        thanos-query        0.25%               4.722GiB / 20GiB      23.61%              0B / 0B             0B / 0B             19
52de1166d541        thanos-sidecar      0.17%               5.414GiB / 12GiB      45.12%              0B / 0B             0B / 36.9kB         20
2ad19964cf43        prometheus          750.40%             8.011GiB / 12GiB      66.76%              0B / 0B             0B / 146MB          31

2nd query:
CONTAINER ID        NAME                CPU %               MEM USAGE / LIMIT     MEM %               NET I/O             BLOCK I/O           PIDS
d219ebcc6a85        thanos-store        0.18%               7.906GiB / 40GiB      19.77%              0B / 0B             0B / 0B             22
703d0f88ac95        thanos-query        0.26%               5.801GiB / 20GiB      29.00%              0B / 0B             0B / 0B             19
52de1166d541        thanos-sidecar      0.23%               5.419GiB / 12GiB      45.16%              0B / 0B             0B / 55.3kB         20
2ad19964cf43        prometheus          19.83%              8.782GiB / 12GiB      73.18%              0B / 0B             0B / 260MB          31

3rd query:
CONTAINER ID        NAME                CPU %               MEM USAGE / LIMIT     MEM %               NET I/O             BLOCK I/O           PIDS
d219ebcc6a85        thanos-store        0.23%               8.053GiB / 40GiB      20.13%              0B / 0B             0B / 0B             22
703d0f88ac95        thanos-query        0.20%               5.82GiB / 20GiB       29.10%              0B / 0B             0B / 0B             19
52de1166d541        thanos-sidecar      0.24%               5.529GiB / 12GiB      46.07%              0B / 0B             0B / 73.7kB         21
2ad19964cf43        prometheus          38.65%              8.785GiB / 12GiB      73.21%              0B / 0B             0B / 374MB          31

10x queries:
CONTAINER ID        NAME                CPU %               MEM USAGE / LIMIT     MEM %               NET I/O             BLOCK I/O           PIDS
d219ebcc6a85        thanos-store        0.33%               10.07GiB / 40GiB      25.17%              0B / 0B             0B / 0B             22
703d0f88ac95        thanos-query        0.31%               5.819GiB / 20GiB      29.10%              0B / 0B             0B / 0B             19
52de1166d541        thanos-sidecar      0.20%               5.533GiB / 12GiB      46.11%              0B / 0B             0B / 79.9kB         21
2ad19964cf43        prometheus          33.81%              12GiB / 12GiB         100.00%             0B / 0B             8.88MB / 758MB      31

Without Thanos

Startup:

CONTAINER ID        NAME                CPU %               MEM USAGE / LIMIT   MEM %               NET I/O             BLOCK I/O           PIDS
d3412c1c8ec4        prometheus          6.67%               3.572GiB / 12GiB    29.76%              0B / 0B             0B / 1.71MB         31

1st query:
d3412c1c8ec4        prometheus          21.56%              4.804GiB / 12GiB    40.04%              0B / 0B             0B / 31.5MB         31

2nd query:
CONTAINER ID        NAME                CPU %               MEM USAGE / LIMIT   MEM %               NET I/O             BLOCK I/O           PIDS
d3412c1c8ec4        prometheus          22.96%              5.403GiB / 12GiB    45.02%              0B / 0B             0B / 49.6MB         31

3rd query:
d3412c1c8ec4        prometheus          20.74%              5.432GiB / 12GiB    45.26%              0B / 0B             0B / 67.9MB         31

We even tried running the setup on a physical machine with 700GB of RAM, and we managed to to reach ~300GB of memory usage by Prometheus container when running 5-simultenious queries {__name__=~".+"} (with timeout increased to 10minutes) on the 1400GB size of S3 bucket via Thanos-Query, which we believe is still too much.

Again, even though it looks like the memory usage for queries via Thanos seems much bigger, maybe we just trying to overcome the speed of light here 🙂 Any guidance on how to properly size for the Thanos (and Prometheus) deployment (e.g. maybe there's a way to see how many samples Thanos-Store sees?) would help a lot as well.

bwplotka commented 6 years ago

I don't get one thing.. Why do you think Thanos Store matters here (or anything in the bucket) if the Prometheus container OOMs not any Thanos component?

tdabasinskas commented 6 years ago

That happens when query goes through Thanos-Query, and does not seem to happen when it's issued dirrectly to Prometheus.

bwplotka commented 6 years ago

I doubt thanos store matters here. Can you disable thanos store for now? (: And try to repro without it?

bwplotka commented 6 years ago

How much series you have in the Prometheus? So basically what can happen is because query evaluation is on Thanos Query so maybe just data size is too big to be sent between Prometheus and sidecar? but that sounds unrealistic. We might want to exactly see how remote_read request looks like.

tdabasinskas commented 6 years ago

Hi @Bplotka,

We we have around 8.5mil (sum(prometheus_tsdb_head_series{fqdn=~"$node"})) series within 4 Prometheus nodes (2 replica sets, with single Thanos cluster on top).

You mentioned, that queries via Thanos uses remote_read (instead API, when querying via Prometheus directly), right? If so, it seems people are reporting that the Prometheus memory usage is indeed highly increased when querying via remote_read (in this case, Thanos query):

"I think the main issue here is what executing queries through remote-read interface takes at least 10x more memory than executing them inside single Prometheus, whenever 1.8 or 2.x"

Of course, in that case, that would mean it's an issue with Prometheus, not with Thanos.

xjewer commented 6 years ago

https://github.com/prometheus/prometheus/pull/4532

xjewer commented 6 years ago

https://github.com/prometheus/prometheus/pull/4591

bwplotka commented 6 years ago

Plus this: https://github.com/improbable-eng/thanos/issues/488 (:

alexdepalex commented 5 years ago

Anybody an idea what's the state on this? We currently have something which looks like the same issue. When Grafana queries prometheus directly, the dashboard finishes in just a couple of seconds. When querying through thanos query, memory usage of both sidecar and prometheus blow up, which will eventually lead to OOM.

containerpope commented 5 years ago

Having the same issues, queries through grafana crash an 8 GB Prometheus instance while queries through Thanos Querier is blowing away memory.

Prometheus Spec

apiVersion: monitoring.coreos.com/v1
kind: Prometheus
metadata:
  annotations:
    flux.weave.works/antecedent: base-monitoring:helmrelease/prom-operator
  creationTimestamp: "2019-06-27T07:52:44Z"
  generation: 15
  labels:
    app: prometheus-operator-prometheus
    chart: prometheus-operator-5.15.0
    heritage: Tiller
    release: prom-operator
  name: prom-operator-prometheus-o-prometheus
  namespace: base-monitoring
  resourceVersion: "4214847"
  selfLink: /apis/monitoring.coreos.com/v1/namespaces/base-monitoring/prometheuses/prom-operator-prometheus-o-prometheus
  uid: 8c33fb10-98b0-11e9-8c8c-4e7d547fa7d0
spec:
  additionalScrapeConfigs:
    key: additional-scrape-configs.yaml
    name: prom-operator-prometheus-o-prometheus-scrape-confg
  alerting:
    alertmanagers:
    - name: prom-operator-prometheus-o-alertmanager
      namespace: base-monitoring
      pathPrefix: /
      port: web
  baseImage: quay.io/prometheus/prometheus
  enableAdminAPI: false
  externalLabels:
    cluster: prod001
    geo: eu
    region: euw
  externalUrl: http://prom-operator-prometheus-o-prometheus.base-monitoring:9090
  listenLocal: false
  logFormat: logfmt
  logLevel: info
  paused: false
  replicas: 2
  resources:
    limits:
      cpu: "2"
      memory: 8Gi
    requests:
      cpu: 100m
      memory: 3Gi
  retention: 30d
  routePrefix: /
  ruleNamespaceSelector: {}
  ruleSelector:
    matchLabels:
      release: prom-operator
  secrets:
  - istio.default
  securityContext:
    fsGroup: 2000
    runAsNonRoot: true
    runAsUser: 1000
  serviceAccountName: prom-operator-prometheus-o-prometheus
  serviceMonitorNamespaceSelector: {}
  serviceMonitorSelector:
    matchLabels:
      release: prom-operator
  storage:
    volumeClaimTemplate:
      selector: {}
      spec:
        accessModes:
        - ReadWriteOnce
        resources:
          requests:
            storage: 50Gi
        storageClassName: default
  thanos:
    image: quay.io/thanos/thanos:v0.6.0
    objectStorageConfig:
      key: thanos.yaml
      name: thanos-objstore-config
    resources:
      limits:
        cpu: "2"
        memory: 6Gi
      requests:
        cpu: 100m
        memory: 2Gi
    tag: v0.6.0
  version: v2.10.0

The nodes them self have sufficient resources available, and in a direct comparison, Grafana queries don't cause any load on the Prometheus instances. Devastating with this is, that both prometheus instances get killed, because query will continue to the next one.

We have two clusters in this scenario, one with the Prometheus instance, one with Thanos Query and Grafana. Query goes only to metrics stored in Prometheus.

Maybe it would be a good idea to decrease retention time to some hours instead of 30d ?

There are 236.834 Timeseries in the server right now.

Currently the ingestion rate is about 15k

This is the query I was running over a duration of 7 days.

histogram_quantile(0.50, sum(rate(istio_request_duration_seconds_bucket{cluster="prod001", destination_workload_namespace=~"mdr-.*", source_app="istio-ingressgateway", destination_workload!="carstreamingservice"}[6h])) by (le, destination_workload_namespace))

bradleybluebean commented 5 years ago

@alexdepalex , @mkjoerg not sure if you're both using Kubernetes but I had a similar issue and realized the k8s Service my Query pod was configured to look at for Store API servers (via --store=dnssrv+ flag) inadvertently contained the Query pod itself so it was causing a loop when trying to handle query requests.

alexdepalex commented 5 years ago

Unfortunately, we're not running on k8s yet. I'll check our config tomorrow, but it seems it's more related to the issues with remote read endpoint. Anyway, I'm still waiting for a "release" version of v2.12.0-rc.0-rr-streaming in order to test it.

krasi-georgiev commented 5 years ago

this is fixed with the remote read streaming. Mem usage now is few times lower: https://github.com/thanos-io/thanos/pull/1268

Upgrade to the latest Thanos version and enjoy this great improvement :smiley_cat:

alexdepalex commented 5 years ago

Still waiting for the next prometheus release which includes streaming, right?

GiedriusS commented 5 years ago

Yep, with the new Sidecar the RAM usage is constant and Prometheus uses noticeably less RAM due to how the new remote read interface works. @krasi-georgiev thanks for bumping this ticket! Closing this.

krasi-georgiev commented 5 years ago

@alexdepalex aah yeah the PR got merged just after 2.12 was cut so yeah need to wait for 2.13 or just use the master image. https://github.com/prometheus/prometheus/commits/master?after=26e8d25e0b0d3459e5901805de992acf1d5eeeaa+34

bwplotka commented 5 years ago

Wait or use our image we prepared Which is essentially 2.12 + remote read extended protocol: quay.io/thanos/prometheus:v2.12.0-rc.0-rr-streaming It's used in production already. (:

On Fri, 13 Sep 2019 at 11:02, Krasi Georgiev notifications@github.com wrote:

@alexdepalex https://github.com/alexdepalex aah yeah the PR got merged just after 2.12 was cut so yeah need to wait for 2.13 or just use the master image.

https://github.com/prometheus/prometheus/commits/master?after=26e8d25e0b0d3459e5901805de992acf1d5eeeaa+34

— You are receiving this because you were assigned. Reply to this email directly, view it on GitHub https://github.com/thanos-io/thanos/issues/455?email_source=notifications&email_token=ABVA3O4KYHM24RFE7NE6VQ3QJNQJ7A5CNFSM4FNHZQ32YY3PNVWWK3TUL52HS4DFVREXG43VMVBW63LNMVXHJKTDN5WW2ZLOORPWSZGOD6USLPI#issuecomment-531178941, or mute the thread https://github.com/notifications/unsubscribe-auth/ABVA3O7O4UXW6DV4H3JCNZLQJNQJ7ANCNFSM4FNHZQ3Q .

vermaabhay commented 4 years ago

"High load on all prometheus nodes due to query fanout from thanos querier"

We have implemented just the querier and sidecar component of thanos, to enable HA(fill gaps in grafana dashboards) for a 2 node prometheus. Since we are executing a very heavy query the load was considerably high on one of the prometheus node, before implementing thanos, now querier fanouts the query to both the nodes and it is creating similar high load on both the nodes compared to what was on just one. Also, we have increased the cpu cores to tackle to problem but it isn't helping. Any leads will be highly appreciated.

prometheus, version 2.13.0 thanos, version 0.8.1

thanos-io / thanos

Query via Thanos causes Prometheus to OOM #455

With Thanos

Without Thanos