Closed tdabasinskas closed 5 years ago
Ok, this is very interesting. Especially that part for querying Prometheus directly and through thanos-query and having OOM while doing for query.
So you are saying that Prometheus OOMs.. Can you isolate that setup and turn off store and all sidecars + Proms but one and try to repro?
Questions:
One fact that matters here: Sidecar (so queries via Thanos-query) is using Prometheus remote_read
which uses protobufs
. For direct Prometheus queries you use just HTTP endpoint, so maybe there is some bug on remote_read
that causes huge mem spike? Worth to look on Prometheus pprof heap as well.
Hi, @Bplotka,
I'm attaching SVG heap traces from PPROF tool , hoping there's something suspicious visible there.
I'm also including docker stats
output below, to better understand what memory limits we had and how memory increased:
Startup:
CONTAINER ID NAME CPU % MEM USAGE / LIMIT MEM % NET I/O BLOCK I/O PIDS
d219ebcc6a85 thanos-store 0.24% 3.288GiB / 40GiB 8.22% 0B / 0B 0B / 0B 20
703d0f88ac95 thanos-query 0.23% 19.98MiB / 20GiB 0.10% 0B / 0B 0B / 0B 18
52de1166d541 thanos-sidecar 0.26% 13.05MiB / 12GiB 0.11% 0B / 0B 0B / 18.4kB 19
2ad19964cf43 prometheus 39.91% 4.554GiB / 12GiB 37.95% 0B / 0B 0B / 32.4MB 31
1st query:
CONTAINER ID NAME CPU % MEM USAGE / LIMIT MEM % NET I/O BLOCK I/O PIDS
d219ebcc6a85 thanos-store 0.29% 7.399GiB / 40GiB 18.50% 0B / 0B 0B / 0B 22
703d0f88ac95 thanos-query 0.25% 4.722GiB / 20GiB 23.61% 0B / 0B 0B / 0B 19
52de1166d541 thanos-sidecar 0.17% 5.414GiB / 12GiB 45.12% 0B / 0B 0B / 36.9kB 20
2ad19964cf43 prometheus 750.40% 8.011GiB / 12GiB 66.76% 0B / 0B 0B / 146MB 31
2nd query:
CONTAINER ID NAME CPU % MEM USAGE / LIMIT MEM % NET I/O BLOCK I/O PIDS
d219ebcc6a85 thanos-store 0.18% 7.906GiB / 40GiB 19.77% 0B / 0B 0B / 0B 22
703d0f88ac95 thanos-query 0.26% 5.801GiB / 20GiB 29.00% 0B / 0B 0B / 0B 19
52de1166d541 thanos-sidecar 0.23% 5.419GiB / 12GiB 45.16% 0B / 0B 0B / 55.3kB 20
2ad19964cf43 prometheus 19.83% 8.782GiB / 12GiB 73.18% 0B / 0B 0B / 260MB 31
3rd query:
CONTAINER ID NAME CPU % MEM USAGE / LIMIT MEM % NET I/O BLOCK I/O PIDS
d219ebcc6a85 thanos-store 0.23% 8.053GiB / 40GiB 20.13% 0B / 0B 0B / 0B 22
703d0f88ac95 thanos-query 0.20% 5.82GiB / 20GiB 29.10% 0B / 0B 0B / 0B 19
52de1166d541 thanos-sidecar 0.24% 5.529GiB / 12GiB 46.07% 0B / 0B 0B / 73.7kB 21
2ad19964cf43 prometheus 38.65% 8.785GiB / 12GiB 73.21% 0B / 0B 0B / 374MB 31
10x queries:
CONTAINER ID NAME CPU % MEM USAGE / LIMIT MEM % NET I/O BLOCK I/O PIDS
d219ebcc6a85 thanos-store 0.33% 10.07GiB / 40GiB 25.17% 0B / 0B 0B / 0B 22
703d0f88ac95 thanos-query 0.31% 5.819GiB / 20GiB 29.10% 0B / 0B 0B / 0B 19
52de1166d541 thanos-sidecar 0.20% 5.533GiB / 12GiB 46.11% 0B / 0B 0B / 79.9kB 21
2ad19964cf43 prometheus 33.81% 12GiB / 12GiB 100.00% 0B / 0B 8.88MB / 758MB 31
Startup:
CONTAINER ID NAME CPU % MEM USAGE / LIMIT MEM % NET I/O BLOCK I/O PIDS
d3412c1c8ec4 prometheus 6.67% 3.572GiB / 12GiB 29.76% 0B / 0B 0B / 1.71MB 31
1st query:
d3412c1c8ec4 prometheus 21.56% 4.804GiB / 12GiB 40.04% 0B / 0B 0B / 31.5MB 31
2nd query:
CONTAINER ID NAME CPU % MEM USAGE / LIMIT MEM % NET I/O BLOCK I/O PIDS
d3412c1c8ec4 prometheus 22.96% 5.403GiB / 12GiB 45.02% 0B / 0B 0B / 49.6MB 31
3rd query:
d3412c1c8ec4 prometheus 20.74% 5.432GiB / 12GiB 45.26% 0B / 0B 0B / 67.9MB 31
We even tried running the setup on a physical machine with 700GB of RAM, and we managed to to reach ~300GB of memory usage by Prometheus container when running 5-simultenious queries {__name__=~".+"}
(with timeout increased to 10minutes) on the 1400GB size of S3 bucket via Thanos-Query, which we believe is still too much.
Again, even though it looks like the memory usage for queries via Thanos seems much bigger, maybe we just trying to overcome the speed of light here 🙂 Any guidance on how to properly size for the Thanos (and Prometheus) deployment (e.g. maybe there's a way to see how many samples Thanos-Store sees?) would help a lot as well.
I don't get one thing.. Why do you think Thanos Store matters here (or anything in the bucket) if the Prometheus container OOMs not any Thanos component?
That happens when query goes through Thanos-Query, and does not seem to happen when it's issued dirrectly to Prometheus.
I doubt thanos store matters here. Can you disable thanos store for now? (: And try to repro without it?
How much series you have in the Prometheus? So basically what can happen is because query evaluation is on Thanos Query so maybe just data size is too big to be sent between Prometheus and sidecar? but that sounds unrealistic. We might want to exactly see how remote_read request looks like.
Hi @Bplotka,
We we have around 8.5mil (sum(prometheus_tsdb_head_series{fqdn=~"$node"})
) series within 4 Prometheus nodes (2 replica sets, with single Thanos cluster on top).
You mentioned, that queries via Thanos uses remote_read
(instead API, when querying via Prometheus directly), right? If so, it seems people are reporting that the Prometheus memory usage is indeed highly increased when querying via remote_read
(in this case, Thanos query):
"I think the main issue here is what executing queries through remote-read interface takes at least 10x more memory than executing them inside single Prometheus, whenever 1.8 or 2.x"
Of course, in that case, that would mean it's an issue with Prometheus, not with Thanos.
Plus this: https://github.com/improbable-eng/thanos/issues/488 (:
Anybody an idea what's the state on this? We currently have something which looks like the same issue. When Grafana queries prometheus directly, the dashboard finishes in just a couple of seconds. When querying through thanos query, memory usage of both sidecar and prometheus blow up, which will eventually lead to OOM.
Having the same issues, queries through grafana crash an 8 GB Prometheus instance while queries through Thanos Querier is blowing away memory.
Prometheus Spec
apiVersion: monitoring.coreos.com/v1
kind: Prometheus
metadata:
annotations:
flux.weave.works/antecedent: base-monitoring:helmrelease/prom-operator
creationTimestamp: "2019-06-27T07:52:44Z"
generation: 15
labels:
app: prometheus-operator-prometheus
chart: prometheus-operator-5.15.0
heritage: Tiller
release: prom-operator
name: prom-operator-prometheus-o-prometheus
namespace: base-monitoring
resourceVersion: "4214847"
selfLink: /apis/monitoring.coreos.com/v1/namespaces/base-monitoring/prometheuses/prom-operator-prometheus-o-prometheus
uid: 8c33fb10-98b0-11e9-8c8c-4e7d547fa7d0
spec:
additionalScrapeConfigs:
key: additional-scrape-configs.yaml
name: prom-operator-prometheus-o-prometheus-scrape-confg
alerting:
alertmanagers:
- name: prom-operator-prometheus-o-alertmanager
namespace: base-monitoring
pathPrefix: /
port: web
baseImage: quay.io/prometheus/prometheus
enableAdminAPI: false
externalLabels:
cluster: prod001
geo: eu
region: euw
externalUrl: http://prom-operator-prometheus-o-prometheus.base-monitoring:9090
listenLocal: false
logFormat: logfmt
logLevel: info
paused: false
replicas: 2
resources:
limits:
cpu: "2"
memory: 8Gi
requests:
cpu: 100m
memory: 3Gi
retention: 30d
routePrefix: /
ruleNamespaceSelector: {}
ruleSelector:
matchLabels:
release: prom-operator
secrets:
- istio.default
securityContext:
fsGroup: 2000
runAsNonRoot: true
runAsUser: 1000
serviceAccountName: prom-operator-prometheus-o-prometheus
serviceMonitorNamespaceSelector: {}
serviceMonitorSelector:
matchLabels:
release: prom-operator
storage:
volumeClaimTemplate:
selector: {}
spec:
accessModes:
- ReadWriteOnce
resources:
requests:
storage: 50Gi
storageClassName: default
thanos:
image: quay.io/thanos/thanos:v0.6.0
objectStorageConfig:
key: thanos.yaml
name: thanos-objstore-config
resources:
limits:
cpu: "2"
memory: 6Gi
requests:
cpu: 100m
memory: 2Gi
tag: v0.6.0
version: v2.10.0
The nodes them self have sufficient resources available, and in a direct comparison, Grafana queries don't cause any load on the Prometheus instances. Devastating with this is, that both prometheus instances get killed, because query will continue to the next one.
We have two clusters in this scenario, one with the Prometheus instance, one with Thanos Query and Grafana. Query goes only to metrics stored in Prometheus.
Maybe it would be a good idea to decrease retention time to some hours instead of 30d ?
There are 236.834 Timeseries in the server right now.
Currently the ingestion rate is about 15k
This is the query I was running over a duration of 7 days.
histogram_quantile(0.50, sum(rate(istio_request_duration_seconds_bucket{cluster="prod001", destination_workload_namespace=~"mdr-.*", source_app="istio-ingressgateway", destination_workload!="carstreamingservice"}[6h])) by (le, destination_workload_namespace))
@alexdepalex , @mkjoerg
not sure if you're both using Kubernetes but I had a similar issue and realized the k8s Service my Query pod was configured to look at for Store API servers (via --store=dnssrv+
flag) inadvertently contained the Query pod itself so it was causing a loop when trying to handle query requests.
Unfortunately, we're not running on k8s yet. I'll check our config tomorrow, but it seems it's more related to the issues with remote read endpoint. Anyway, I'm still waiting for a "release" version of v2.12.0-rc.0-rr-streaming in order to test it.
this is fixed with the remote read streaming. Mem usage now is few times lower: https://github.com/thanos-io/thanos/pull/1268
Upgrade to the latest Thanos version and enjoy this great improvement :smiley_cat:
Still waiting for the next prometheus release which includes streaming, right?
Yep, with the new Sidecar the RAM usage is constant and Prometheus uses noticeably less RAM due to how the new remote read interface works. @krasi-georgiev thanks for bumping this ticket! Closing this.
@alexdepalex aah yeah the PR got merged just after 2.12 was cut so yeah need to wait for 2.13 or just use the master image. https://github.com/prometheus/prometheus/commits/master?after=26e8d25e0b0d3459e5901805de992acf1d5eeeaa+34
Wait or use our image we prepared Which is essentially 2.12 + remote read
extended protocol: quay.io/thanos/prometheus:v2.12.0-rc.0-rr-streaming
It's used in production already. (:
On Fri, 13 Sep 2019 at 11:02, Krasi Georgiev notifications@github.com wrote:
@alexdepalex https://github.com/alexdepalex aah yeah the PR got merged just after 2.12 was cut so yeah need to wait for 2.13 or just use the master image.
— You are receiving this because you were assigned. Reply to this email directly, view it on GitHub https://github.com/thanos-io/thanos/issues/455?email_source=notifications&email_token=ABVA3O4KYHM24RFE7NE6VQ3QJNQJ7A5CNFSM4FNHZQ32YY3PNVWWK3TUL52HS4DFVREXG43VMVBW63LNMVXHJKTDN5WW2ZLOORPWSZGOD6USLPI#issuecomment-531178941, or mute the thread https://github.com/notifications/unsubscribe-auth/ABVA3O7O4UXW6DV4H3JCNZLQJNQJ7ANCNFSM4FNHZQ3Q .
"High load on all prometheus nodes due to query fanout from thanos querier"
We have implemented just the querier and sidecar component of thanos, to enable HA(fill gaps in grafana dashboards) for a 2 node prometheus. Since we are executing a very heavy query the load was considerably high on one of the prometheus node, before implementing thanos, now querier fanouts the query to both the nodes and it is creating similar high load on both the nodes compared to what was on just one. Also, we have increased the cpu cores to tackle to problem but it isn't helping. Any leads will be highly appreciated.
prometheus, version 2.13.0 thanos, version 0.8.1
Thanos, Prometheus and Golang version used
What happened
The query provided below timeouts (after the default 2 minutes query timeout), but before it does that, Prometheus gets OOM.
What you expected to happen
Either the query to complete successfully, or timeout (assuming it's too complex) without bringing Prometheus down.
How to reproduce it (as minimally and precisely as possible):
We are using the following command to perform the query:
Running this once or twice, always brings our nodes down. If we run the same query directly against Prometheus (not via Thanos Query), it completes quite fast successfully (we store 24h data on Prometheus).
Anything else we need to know
We are running independent Docker hosts with the following containers:
Thanos Compactor is running independently on a different host. We also try running Thanos Store on a different host than other containers.
One thing to note, that if we have multiple (let's say have 4 nodes) this configuration and put Thanos Query under an Nginx-based load balanced, a query to one of the Thanos Query instances brings all 4 hosts down (due to all 4 Promehetues getting OOM).
Environment:
Storage:
We are not sure whether the issue is caused by Prometheus or Thanos. Maybe it's somehow directly related to our setup (for example
join_joined_left_vs_right_timestamp_diff_bucket
metric having a huge number of different labels, which results into a large number of different timeseries that cannot be handled when running the mentioned query). Anyhow, any guidance or tips would be really appreciated.