Closed cawfeecoder closed 1 year ago
To add to this ticket:
When I perform a sum(sum(up{job="node_exporter"}) by (instance)), I can force the full list of results back to Promxy and then get it to correctly calculate the sum. Still not ideal, since it breaks a ton of current alerting/dashboards.
The report here is defintely concerning as this is the primary/basic functionality of promxy. I haven't seen these sorts of issues in the tests or in my own deployment. So chances are something is off in the config or one of the downstreams is doing something screwy. Would you be able to provide a repro case? Ideally some containers or something I can stand up locally to reproduce the behavior. Alternatively if you could provide the trace level output with your config we can attempt to figure out what happened from there.
@jacksontj I'll work on getting a reproducible example. It was an concerning seeing it, but all of the configs downstream tend to be standard.
@jacksontj Here is a reproducible example. I ran my Promxy on the same docker host, outside of a container. repro-example.tar.gz. There's two seperate Promxy configs included in here (one with both in the same ServerGroup and each in their own ServerGroup) in order to rule out any foul play of a configuration on my end.
@jacksontj can we get some type of clarification on this? It would be helpful to know your thoughts on the above repro'd example :)
Sorry about the huge delay (its been a busy time). So I got a few minutes to look at this today and I can reproduce the issue with the config you provided. I also understand what the issue.
This "change in behavior" was introduced in https://github.com/jacksontj/promxy/pull/134 -- in that PR we moved from having the "server_group" be the aggregation/dedupe unit to having the labelset be the unique unit. This actually enables some very interesting sharded setups (e.g. https://medium.com/wish-engineering/horizontally-scaling-prometheus-at-wish-ea4b694318dd) but does come with the caveat that servergroups with no labels are treated as if they are the same (which makes sense, but was an unintended side-effect).
So, a fix here would be to have some "internal labels" that you can opt-out of (if you need cross-SG aggregation). I'll have to look into how to fix that in a not-so-painful way, but as a workaround if you add any labels
to make the different SGs unique it'll solve your problem in the short-term.
cc @nfrush @jlarger
k, I have spent some more time thinking about this and I think I have an idea.
Unfortunately the option I had suggested initially (internal labels that get removed on return) won't work properly since you'd have the possibility of getting label collisions in the end result. As an example if 2 SGs both had foo
(no labels) and they each added some internal label which was removed before returning -- the return would have 2 foo
in it -- which is an invalid result.
So at this point I think the best we can do in promxy is add some checks across the SGs for collisions/matching labels to warn/error to the user. As an example if we added a flag allowCrossServerGroupAggregation=false
(terrible-name, but to enforce that the servergroups don't aggregate across eachother) we could return an error if the labels within the SG nodes match across SGs (meaning they could aggregate across). This check could catch both labels
and relabel_config
options-- which would actually allow you to get some feedback on those relabel_config aggregations without the current mostly guess-and-check we do today.
TLDR; we need some labels to separate which SGs shouldn't be merged; best-case I can add in some checks to warn/error in a situation like this one (2 separate SGs, which are getting combined together)
Hi, @jacksontj. I find a similar problem. The difference is that there is only one server_group
but using the dns_sd_configs
to discover all the instances behind a k8s headless service.
The result of sum(up{...})
is also not the same as the one of sum(sum by (instance) (up{...}))
.
Here is my configuration:
server_groups:
- scheme: http
dns_sd_configs:
- names:
- "the-name-of-headless-svc"
type: A
port: 9090
I tried to add relabel_configs
on each of the prometheus instance behind it,
scrape_configs:
- relabel_configs:
source_labels: ["__address__"]
target_label: "prometheus_shard"
replacement: ${SHARD_NO}
However, it still does not work. Is this a correct way as a workaround to make the different SGs unique as you said in the previous comment? Or maybe this could be another case?
The relabel config would need to be added to the server_group
section in promxy. If you put that same relabel config in the server group config it should work.
Could you please telm me - what i'm doing wrong? I have 2 servers with identical config files (prometheus.yml) and I collect metrics from the same servers. in Grafana I added two datasources - Prometheus (for testing) and Promxy (http://ip:8082), but when I select the second one, I see data duplication. My prometheus config (on both servers with Prometheus), the only difference is in the label:
global:
scrape_interval: 15s
evaluation_interval: 15s
scrape_configs:
- job_name: 'grafana'
static_configs:
- targets: ['192.168.3.2:3000']
labels:
host: ${HOSTNAME}
Example:
My promxy config:
global:
evaluation_interval: 15s
external_labels:
source: promxy
promxy:
server_groups:
- static_configs:
- targets:
- 192.168.3.3:9090
- 192.168.3.4:9090
anti_affinity: 10s
scheme: http
http_client:
dial_timeout: 5s
tls_config:
insecure_skip_verify: true
I am also getting the similar issue. In my environment, I have two clusters and two prometheus instances are running for each cluster. Now, when I hit these two cluster prometheus endpoints with below query, I am getting the values for each clusters correctly.
Query:
sum(rate(nginx_ingress_controller_requests{environment=~"abc-prd-w2|def-prd-w2"}[5m])) by (environment)
Result:
{environment="abc-prd-w2"} 183
{environment="def-prd-w2"} 2255
However, the below query is returning the value of one of the cluster instead of summing the values from both clusters.
Query:
sum(rate(nginx_ingress_controller_requests{environment=~"abc-prd-w2|def-prd-w2"}[5m]))
Result:
{} 183.13364550781847
In this case, to get the sum of all the clusters, I need to add alter the query like below:
sum(sum(rate(nginx_ingress_controller_requests{environment=~"abc-prd-w2|def-prd-w2"}[5m])) by (environment))
Result:
{} 2438
Here is my configuration:
apiVersion: v1
data:
config.yaml: |
global:
evaluation_interval: 5s
external_labels:
source: promxy
promxy:
server_groups:
- static_configs:
- targets:
- abc-prd-w2.prometheus.test.com
scheme: http
- static_configs:
- targets:
- def-prd-w2.prometheus.test.com
scheme: http
@anoop2503 I believe this issue has been resolved (as I am unable to reproduce this issue) so I'll close out this issue. If you can still re-reproduce please re-open and we can dig in!
Per documentation, I have 4 servers (2 sets of 2) each scraping 10 targets. Let's say these targets are "taget1" through "target10". When I try to do a sum query (i.e. "sum(up{job="node_exporter"})", I get returned a value of 5 (although all 10 are up). When I do a "sum{up(job="node_exporter"}) by (instance)", all targets are correctly returned.
I am running version 0.0.49 if that matters.