go-graphite / carbonapi

Implementation of graphite API (graphite-web) in golang
Other
308 stars 140 forks source link

[BUG] Alerts in grafana do not work when aliasByNode is used with divideSeries. #533

Open jdblack opened 3 years ago

jdblack commented 3 years ago

Describe the bug Alerts that worked with grafana and graphite-web that include aliasByNode and divide_series that worked with graphite-web are failing with carbonapi.

We are in the process of migrating our stack from grafana -> graphite-web -> (graphite-web & go-carbon) <- carbon-c-relay
to grafana -> carbonapi -> (go-carbon w/ carbonserver) <- carbon-c-relay

The following three series render without error in grafana with both graphite-web and carbonapi+carbonserver:

aliasByNode(divideSeries(kafka.*.cs-clouddetections.lag, #B), 1, 2)
sumSeries(cs-clouddetections.*.meter.clouddetections.kafka.events.received.one-minute)
aliasByNode(removeEmptySeries(kafka.*.cs-clouddetections.status.value), 1, 2, 3)

The following alerts, which worked when using graphite-web, stop working when we swing to carbonapi & carbonserver:

WHEN last() of query(A,5m, now) is above 500
OR last() of query(C,5m,now) is above 1

Grafana logs show the following error : t=2020-10-20T15:38:42+0000 lvl=info msg="Request failed" logger=tsdb.graphite status="500 Internal Server Error" body="Internal Server Error: error or no response: function=aliasByNode\n"

These alerts no longer fail if we remove aliasByNode from query #A

Versions Grafana: 7.1.5 Carbonapi: 0.14.1-1 go-carbon: 0.15.0 OS: Ubuntu 16.04

Civil commented 3 years ago

Could you please provide following information:

  1. Does this happens with current master of carbonapi as well?
  2. Logs from the carbonapi (unfortunately grafana logs are not usable to debug carbonapi bugs)
  3. Could you please try to simplify query? One of the ways to do that is to query graphite-web (with format=json) and carbonapi and find a step that produce different results.
  4. Anything specific about the data itself? (backend reply for the metrics themselve, if you cannot share the reply, any specifics about it might be helpful, such as if it contains NaNs, Infs or any other values that are werid? Do all the metrics have same aggregation schema? Does the query cross the retention boundry? Etc.)