Open leikahing opened 7 years ago
My first suspicion is that carbonserver is returning series with mixed density, as your request is right on a rollup boundary. This would frequently break the logic in the internal aggregateSeries func which expects all series to be consistent length.
If you make it from=-8d
does the problem persist?
@nnuss - I did suspect that it might be related to the aggregation boundary. If I do -8d
it works consistently and returns the data every time. Same if I do anything up to the aggregation window (like -6d
or -5d
), or if I keep going out further like -30d
.
Attempting data at the rollout periods (-1d
, -7d
, -1y
) I can consistently reproduce the problem of either getting data back or getting back []
.
That strongly suggests to me that carbonserver is racing with the whisper writes and/or time skew on the backend(s).
@Civil is there any go-carbon / carbonserver tuning to help here?
@birryree how far beyond -7d is required to get a consistent view?
-7d1s
-7d1min
We would like carbonapi to resolve this but it requires more meta information to be available during render: retention bands, aggregation function, and x-files factor for each series.
First of all, try to use go-carbon's carbonserver. It have support for that since 0.9.0 and it's also covered with some tests, so it will be easier to fix an error there (if it's not fixed there). Also you can do that not as a separate instance, but as part of your main go-carbon.
@nnuss If I modify the first query seen as part of the stacktrace log entry above and use from=-7d1s
, it consistently returns results. Wrote a shell script to hit the API hundreds of times and every time a non-empty data list was returned. If I modify my script to use from=-7d
, the results are back to either []
or non-empty again.
I checked the backend servers - every server involved runs ntpd
and is synchronized to the same server pool. Checking their dates, they all report the same time.
@Civil I originally was trying to run go-carbon's carbonserver when I set this stack up, but could never get it running - netstat
didn't show anything listening on the port I configured it to run on. I don't see anything in the logs, even with debug
logging.
I'm running the latest 0.9.0 release via .deb package - my configuration for go-carbon follows:
[common]
user = "carbon"
logfile = "/var/log/go-carbon/go-carbon.log"
log-level = "debug"
graph-prefix = "carbon.agents.eu-west-1.graphite-a"
metric-endpoint = "tcp://carbonrelay.internal:2003"
max-cpu = 3
metric-interval = "1m0s"
[whisper]
data-dir = "/mnt/data/whisper/"
schemas-file = "/etc/go-carbon/storage-schemas.conf"
aggregation-file = "/etc/go-carbon/storage-aggregation.conf"
workers = 8
max-updates-per-second = 0
sparse-create = false
enabled = true
[cache]
max-size = 1000000
write-strategy = "max"
[udp]
listen = ":2003"
enabled = true
log-incomplete = false
buffer-size = 0
[tcp]
listen = ":2003"
enabled = true
buffer-size = 0
[pickle]
listen = ":2004"
max-message-size = 67108864
enabled = true
buffer-size = 0
[carbonlink]
listen = "127.0.0.1:7002"
enabled = true
read-timeout = "30s"
[carbonserver]
listen = "127.0.0.1:8080"
enabled = true
buckets = 10
max-globs = 1000
metrics-as-counters = false
read-timeout = "5m0s"
write-timeout = "5m0s"
scan-frequency = "10m0s"
[dump]
enabled = false
path = ""
restore-per-second = 0
[pprof]
listen = "localhost:7007"
enabled = false
Interesting. Actually if carbonserver support is enabled, it should print at least "[carbonserver] carbonserver support is still experimental, use at your own risk" in the logs. If it's not - maybe there are some problems with your deb package and what you've got is not 0.9. Also might be worth to test is to just build it manually and see if carbonserver module will work then.
@Civil As it turns out, when I execute sudo service go-carbon stop
or sudo service go-carbon restart
, it wasn't actually stopping/restarting the go-carbon daemon, so it was never re-reading my configuration and starting the go-carbon server. I had to do a manual kill
and bring the service back for it to recognize my configuration changes and start the carbonserver.
I'm running the current go-carbon off the head of master
, and I've seen a slight improvement in some queries. I can do this hardware scan now of thousands of whisper files (with some success):
format=json&from=-7d&maxDataPoints=157&target=timeShift(sumSeries(consolidateBy(orp.{prod}.{ap-northeast-1,ap-northeast-1b,eu-west,eu-west-1,us-east,us-east-1}.hw.*.hostup,+'sum')),+'120s')&until=now
However, I'm still running into the issue with queries against aggregation boundaries sometimes returning []
.
I just spun this up to test it to based off the 0.9.0 go-carbon and latest master of zipper/api today. sumSeries seems to break. Ive attached the the debug output from carbonzipper and the debug line from carbonserver1 im not sure how to interpret this.
These queries are for 3hours and the data most likely has gaps in the data due to some UDP loss. I wonder if its the nulls causing the issue? The best feature in the world "keeplastvalue" does not seem to solve the issue as normal with graphite!
My first aggregation block is at the 6 hour window. So it breaks on raw data for me.
Well that's interesting i just realized one of the files has an old aggregator setting (as shown in the debug logs above). Legacy graphite handles this just fine. However it looks like if the aggregations between 2 retrieved points don't match the problem occurs in carbonapi. I went in and whisper-resized the collectd tree, now carbonapi seems consistent now for me. But it doesn't explain the other issues listed here unless in some cases carbonserver is not sending the same amount of datapoints?
This is occurring in the latest
master
commit 9a17fbb:I have been noticing some problems with a number of queries resulting in
runtime error: index out of range
messages.carbonapi
andcarbonzipper
.carbonzipper
is configured to connect to acarbonserver
on a separate server (ago-carbon
box sitting on the same network)carbonapi
daemon parameters:carbonzipper
configuration:carbonserver
daemon parameters:When I run queries for time periods like last 15 minutes, or last 30 minutes, or last 6 hours, things run fine.
When I expand my query out to last 7 days, that's when I start noticing this behavior.
I will get the following panic on certain queries in the
carbonapi
log file:Retrying the query multiple times sometimes succeeds, but most times it fails and returns an empty list.
Another query I'm seeing this with is:
Here, the wildcard
*
actually expands out to 5000+ unique subfolders (server metrics for every server the service has or ever had running). At the 7 day and beyond mark, I can't get data for that query at all.All whisper files in question have the following storage schema set on them:
I have another server running an old version of Graphite 0.9.12 where all my metrics are mirrored that is serving up these queries just fine, so I'm wondering where my configuration might be falling short.