asPercent producing weird results

mzealey commented 4 years ago

This query works fine in graphite-web with go-carbon backend but not in carbonapi (0.13 from rpm):

asPercent(a.b.*.apache.main.apache_scoreboard.waiting, groupByNodes(a.b.*.apache.main.apache_scoreboard.*, "sum", 2))

It's producing many values > 100 even though waiting is all 0-150 range and the result of groupByNodes plotted by itself has all points at 150.

I first thought it was to do with ordering not being preserved but it doesn't appear to be that. Any ideas?

Civil commented 4 years ago

I need to reproduce that on my side to understand what's wrong. It would be helpful if you can provide a bit more information about metrics:

How many metrics matches those queries?
For one example of metric that produce more than 100 could you provide some of the raw values for the waiting metric?
What's the time range for the query?
What's the retention schema for those metrics?

mzealey commented 4 years ago

If I replace node 2's * with a single point there is no difference.

There are about 10 properties under the final . If I set the final to waiting then it correctly returns 100 for each metric. If i change the final * to waiting and one other metric it returns crazy values.

So basically we can reduce this case to:

asPercent(a.b.c.apache.main.apache_scoreboard.waiting, groupByNodes(a.b.c.apache.main.apache_scoreboard.{waiting,open}, "sum", 2))

where open is 0-5 and waiting is 100-150

This applies over all time ranges (30min - 7 days)

retention for these files 10s:6h,5m:30d,30m:1y,60m:3y

mzealey commented 4 years ago

Hey, any idea why this is happening? It's really annoying to have to revert to graphite for some queries even though I can use CarbonAPI for 95% of them...

deniszh commented 4 years ago

@mzealey : btw, it's a bit buried down in documentation, but carbonapi has ability to forward functions call to real graphite-web, see https://github.com/go-graphite/carbonapi/blob/main/cmd/carbonapi/graphiteWeb.example.yaml It's still not really convenient, but maybe help with your migration.

Civil commented 4 years ago

@mzealey the problem is that I can't reproduce this behavior at all. For me graphite-web and carbonapi returns same results.

graphite-web:

$ wget -q -O- 'http://localhost:8080/render/?target=asPercent(a.waiting, groupByNodes(a.{open,waiting}, "sum", 0))&format=json'; echo
[{"target": "asPercent(a.waiting,a)", "tags": {"name": "asPercent(a.waiting,a)"}, "datapoints": [[100.0, 1], [99.09909909909909, 2], [98.21428571428571, 3], [98.21428571428571, 4], [98.0392156862745, 5]]}]

carbonapi:

$ wget -q -O- 'http://localhost:8081/render/?target=asPercent(a.waiting, groupByNodes(a.{open,waiting}, "sum", 0))&format=json'; echo
[{"target":"asPercent(a.waiting,groupByNodes(a.{open,waiting}, \"sum\", 0))","datapoints":[[100,1],[99.09909909909909,2],[98.21428571428571,3],[98.21428571428571,4],[98.0392156862745,5]],"tags":{"name":"a.waiting"}}]

Test data I'm using:

$ wget -q -O- 'http://localhost:8080/render/?target=a.{open,waiting}&format=json'; echo
[{"target": "a.open", "tags": {"name": "a.open"}, "datapoints": [[0.0, 1], [1.0, 2], [2.0, 3], [2.0, 4], [3.0, 5]]}, {"target": "a.waiting", "tags": {"name": "a.waiting"}, "datapoints": [[100.0, 1], [110.0, 2], [110.0, 3], [110.0, 4], [150.0, 5]]}]

If that would be helpful, I'm trying to run a fake backend (see cmd/mockbackend) with following config:

$ cat asPercent.yaml 
listeners:
  - address: ":9070"
    expressions:
      "a.open":
        pathExpression: "a.open"
        data:
            - metricName: "a.open"
              values: [0,1,2,2,3]
      "a.waiting":
        pathExpression: "a.waiting"
        data:
            - metricName: "a.waiting"
              values: [100,110,110,110,150]
      "a.*":
        pathExpression: "a.*"
        data:
            - metricName: "a.waiting"
              values: [100,110,110,110,150]
            - metricName: "a.open"
              values: [0,1,2,2,3]
      "a.{open,waiting}":
        pathExpression: "a.{open,waiting}"
        data:
            - metricName: "a.waiting"
              values: [100,110,110,110,150]
            - metricName: "a.open"
              values: [0,1,2,2,3]

it can answer carbonapi_v2_pb and pickle to graphite-web, so I'm pointing both of them to same datasource and doing following request /render/?target=a.{open,waiting}&format=json

Could you please verify if groupByNodes actually returns same results for your query?

mzealey commented 4 years ago

OK standard graphite backend (from the docker image, using carbon-go to pull data in but that shouldnt affect anything). Graphite's graph (last 7 days):

CarbonAPI's graph:

Query for both (although everything to the first | should be enough to show the differences):

asPercent(a.b.xxx.apache.main.apache_scoreboard.waiting, groupByNodes(a.b.xxx.apache.main.apache_scoreboard.*, "sum", 2)) | scale(-1) | offset(100) | aliasByNode(2) | highestAverage(20)

Interestingly, i am switching backend to clickhouse-graphite and the carbonapi integration there is returning the correct graphs. Switching graphite -> carbon to pb2 doesnt change anything though.

Here is tgz with the wsp data sources used for this graph test.tar.gz

mzealey commented 4 years ago

Also, not sure if it is the same or a different issue but in another graph when using divideSeriesLists the graphs look significantly different between graphite & carbonapi, however when I select just a handful of metrics they look correct. Happy to raise a different ticket for that but it's a bit trickier to reproduce

Civil commented 4 years ago

So to clarify: you are currently using go-carbon as a backend and there you get wrong data, but once you switch to carbon-clickhouse it is correct?

In that case, could you please answer following questions:

Could you please give configs for go-carbon (please also specify it's version)? Including storage schemas and aggregation schemas
How is graphite-web configured? I mean how it talks with go-carbon in that case? Or for graphite-web you are using official python carbon daemon?
I've noticed that you are querying data for multiple days. By any chance, do you cross any retention periods there? E.x. if you query data not for 7 days, but for 6 days (or even for 1 day) would it change the results

mzealey commented 4 years ago

I'm using docker image graphiteapp/graphite-statsd:master with GOCARBON=1. I didn't change any graphite-web config so believe it is just accessing the whisper files directly for querying.

Wrt (3) even if I only do for last 3 hours (per schema config first retention period change is at 6h) it still shows wrong data.

Go-carbon config:

[whisper]
data-dir = "/opt/graphite/storage/whisper"
schemas-file = "/opt/graphite/conf/storage-schemas.conf"
aggregation-file = "/opt/graphite/conf/storage-aggregation.conf"
workers = 12
max-updates-per-second = 0
max-creates-per-second = 0
hard-max-creates-per-second = false
sparse-create = false
flock = true
enabled = true
hash-filenames = true
compressed = false
remove-empty-file = true

[cache]
max-size = 100000000 
write-strategy = "noop"

[udp]
listen = ":2003"
enabled = true
buffer-size = 0

[tcp]
listen = ":2003"
enabled = true
buffer-size = 0

[pickle]
listen = ":2004"
max-message-size = 67108864
enabled = true
buffer-size = 0

[carbonlink]
listen = "127.0.0.1:7002"
enabled = true
read-timeout = "30s"

Default agg:

[min]
pattern = \.lower$
xFilesFactor = 0.1
aggregationMethod = min

[max]
pattern = \.upper(_\d+)?$
xFilesFactor = 0.1
aggregationMethod = max

[sum]
pattern = \.sum$
xFilesFactor = 0
aggregationMethod = sum

[count]
pattern = \.count$
xFilesFactor = 0
aggregationMethod = sum

[count_legacy]
pattern = ^stats_counts.*
xFilesFactor = 0
aggregationMethod = sum

[default_average]
pattern = .*
xFilesFactor = 0.3
aggregationMethod = average

...

[carbonserver]
listen = "0.0.0.0:8000"
enabled = true
buckets = 10
metrics-as-counters = true
read-timeout = "60s"
write-timeout = "60s"
query-cache-enabled = false
query-cache-size-mb = 0
find-cache-enabled = true
trigram-index = true
scan-frequency = "5m0s"
trie-index = false

max-globs = 100
fail-on-max-globs = false

max-metrics-globbed  = 30000
max-metrics-rendered = 1000

graphite-web-10-strict-mode = true
internal-stats-dir = ""
stats-percentiles = [99, 98, 95, 75, 50]

The matching storage schema which those whisper files should have been created with is:

[default]
pattern = .*
retentions = 10s:6h,5m:30d,30m:1y,60m:3y

Civil commented 4 years ago

Would graphite-web's behavior change if you point to carbonserver, instead of reading wsp files directly?

According to the docs for the image, you can do that by setting GRAPHITE_CLUSTER_SERVERS="127.0.0.1:8000" (that's mentioned in https://hub.docker.com/r/graphiteapp/graphite-statsd/ in "Experimental Features").

And also could you please share carbonapi's config as well?

deniszh commented 4 years ago

Also, carbonlink should be disabled in such test with GRAPHITE_CARBONLINK_HOSTS=""

Civil commented 4 years ago

And by any chance, is this a single docker container or you have multiple go-carbon docker containers in carbonapi?

Civil commented 4 years ago

I've tried to reproduce your issue with files you've provided, but also no luck:

graphite-web (current master) with CLUSTER_SERVER - returns exactly same data
graphite-web (current master) pointed directly to whisper files - returns same result as carbonapi do

I've used go-carbon from the current master.

So if I can't reproduce the issue I can't fix it. But overall it makes me think that it could be related either to dockerimage or to software versions those docker images are using and not to the carbonapi.

However I found a small issue with how auto worked and that in some cases it could've caused carboapi to fail to start.

nikobearrr commented 4 years ago

I have also noticed some weird behaviour of asPercent().

In my case I have 2 metrics:

data.*.valid
data.*.invalid

I would like to find the % of the valid values from the total. So what I am doing is first sumSeries() then asPercent() I use grafana, so we have 3 series

A: sumSeries(data.*.valid)
B: sumSeries(data.*.*valid)
C: asPercent(#A, #B)

I hide #A and #B and I get value of 400-500%, which surely is impossible. If I unhide one of the main series (#A or #B) I see the correct value.

I would like to try to reproduce this with mock data. @Civil can you please give a guide on how to use the fake backend? Then I will try to provide you with an example of the case I have.

Civil commented 4 years ago

Example for the test: https://github.com/go-graphite/carbonapi/blob/main/cmd/mockbackend/testcases/i484/i484.yaml

Structure:

version: "v1" - config version in case I'll want to change something in future.

Query part of the test test - main section that describes how to perform the test

test:
    apps:
        - name: "carbonapi"
          binary: "./carbonapi"
          args:
              - "-config"
              - "./cmd/mockbackend/testcases/i484/carbonapi.yaml"

What to run before starting the test. This example uses it's own test config, however most of the tests I hope to have should use "./cmd/mockbackend/carbonapi_singlebackend.yaml" as a config, if that's possible.

    queries:
            - endpoint: "http://127.0.0.1:8081"
              delay: 1
              type: "GET"
              URL: "/render/?target=a.open&format=json"

queries define what will be sent to carbonapi (endpoint says where to look for it). URL is just test field that contains url-decoded version of URL (it'll be enceded anyway).

delay is a delay in seconds after previous query was finished (or since beginning of the test, in case it's first query).

For JSON

Theoretically it could have more than 1 query and to different endpoint, but I haven't tested that yet.

              expectedResponse:
                  httpCode: 200
                  contentType: "application/json"
                  expectedResults:
                          - metrics:
                                  - target: "a.open"
                                    datapoints: [[0,1],[1,2],[2,3],[2,4],[3,5]]

Have all the characteristics of the response, content type, http code and metric itself.

target is how metric will be named

datapoints - actual data that will be returned. Format is value, timestamp.

Currently I have no support for checking tag values.

For graphs

If your case related to png/svg rendering the only way to verify result I came up with is to check sha256 checksum:

              expectedResponse:
                  httpCode: 200
                  contentType: "image/svg+xml"
                  expectedResults:
                    - sha256:
                            - "6d9b18d1fe7264cc0ceb1aa319bf735d346f264bae058e0918d1e41437834aa7" # sha256(nodata svg) on Gentoo stable
                            - "33d0b579778e2e0bfdb7cf85cbddafe08f5f97b720e1e717d046262ded23cdf2" # sha256(nodata svg) on Ubuntu Xenial (travis-ci)

Unfortunately it heavily depends on fontconfig and sha256 might be different on different machines so for PR that contains test I would ask to provide example png of expected result with short description what is currently wrong as those results will likely be different on my test system.

Example of test that checks svg image: https://github.com/go-graphite/carbonapi/blob/main/cmd/mockbackend/testcases/i503/i503.yaml

I have some plans to ignore some fields inside SVG, but I haven't implemented that yet.

Data

listeners:
  - address: ":9070"
    expressions:
      "a.open":
        pathExpression: "a.open"
        data:
            - metricName: "a.open"
              values: [0,1,2,2,3]

this defines what mockbackend will be able to return.

What is important here:

In epxression you need to list all possible queries that will be made towards backend. "a.open" in this case is what will be specified in target, pathExpression: "a.open" is pathExpression field in response (most of the time should match with what was passed in target, so likely I would remove that field in future), data is what actual list of metrics will be returned.

For the metrics format:
metricName - name of the metric in reply
values - values. Timestamp will be automatically calculated and by default will start from 1. For NaN you should use yaml's way to specify it which is .NaN.
startTime - override timestamp of first value.
step - override step (otherwise will be 1).

Another example: https://github.com/go-graphite/carbonapi/blob/main/cmd/mockbackend/testcases/pr500/pr500.yaml

How to run tests

There are several ways to do that:

e2e_test.sh - just will run all of them
make mockbackend will compile mockbackend. You can run it with mockbackend -test -config./cmd/mockbackend/testcases/i487/i487.yaml. If you want to get logs from carbonapi, you can start it manually and run mockbackend -test -noapp -config .... If you omit -test flag, mockbackend will only reply to requests.

Current limitations

Metric reply support is incomplete as it cannot expand globs automatically
Supported protocols: carbonapi_v2_pb and pickle (you can run graphite-web against it). carbonapi_v3_pb is implemented but I haven't tried a lot of queries.

reyjrar commented 1 year ago

I have also noticed some weird behaviour of asPercent().

In my case I have 2 metrics:
* data.*.valid

* data.*.invalid
I would like to find the % of the valid values from the total. So what I am doing is first sumSeries() then asPercent() I use grafana, so we have 3 series
* #A: `sumSeries(data.*.valid)`

* #B: `sumSeries(data.*.*valid)`

* #C: `asPercent(#A, #B)`
I hide #A and #B and I get value of 400-500%, which surely is impossible. If I unhide one of the main series (#A or #B) I see the correct value.

I would like to try to reproduce this with mock data. @Civil can you please give a guide on how to use the fake backend? Then I will try to provide you with an example of the case I have.

I mentioned in #526 I am seeing the same thing. Carbonapi reporting 3-4 times the value as graphite-web produces for the same queries.

go-graphite / carbonapi

asPercent producing weird results #487

A: `sumSeries(data.*.valid)`

B: `sumSeries(data..valid)`

C: `asPercent(#A, #B)`

go-graphite / carbonapi

asPercent producing weird results #487

A: sumSeries(data.*.valid)

B: sumSeries(data.*.*valid)

C: asPercent(#A, #B)

A: `sumSeries(data.*.valid)`

B: `sumSeries(data..valid)`

C: `asPercent(#A, #B)`