graphite-project / graphite-web

A highly scalable real-time graphing system
http://graphite.readthedocs.org/
Apache License 2.0
5.89k stars 1.26k forks source link

datapoints value returned on cache.log #2423

Closed JosephFY closed 5 years ago

JosephFY commented 5 years ago

Hello, Using graphite_web-1.1.3-py2.7.egg-info

I am pushing data every one minute. I have a dashboard with a time period of the last 5 minutes in Grafana. The query inspector on Grafana shows only 5 data points returned when I refresh the dashboard.

Everytime I refresh the dashboard for the metrics on Grafana the count of returned x datapoints increments the value by one in cache.log

CarbonLink cache-query request for xxxx.xxxx.xxx returned **27** datapoints

I am setting up a big query and it's taking long time (which is expected), My concern if there is more data is being requested that's making the query taking more than it should.

Is this the intended behaviour ? it just doesnt make sense to me.

piotr1212 commented 5 years ago

Not sure but I think that carbonlink returns all datapoints in the cache. If the number increases that would mean your cache size is increasing. Check if your cache size is increasing in the graphite carbon metrics. If so, you could try to increase your max_updates in carbon conf.

On Mon, 4 Feb 2019, 20:35 JosephFY <notifications@github.com wrote:

Hello, Using graphite_web-1.1.3-py2.7.egg-info

I am pushing data every one minute. I have a dashboard with a time period of the last 5 minutes in Grafana. The query inspector on Grafana shows only 5 data points returned when I refresh the dashboard.

Everytime I refresh the dashboard for the metrics on Grafana the count of returned x datapoints increments the value by one in cache.log

CarbonLink cache-query request for xxxx.xxxx.xxx returned 27 datapoints

I am setting up a big query and it's taking long time (which is expected), My concern if there is more data is being requested that's making the query taking more than it should.

Is this the intended behaviour ? it just doesnt make sense to me.

— You are receiving this because you are subscribed to this thread. Reply to this email directly, view it on GitHub https://github.com/graphite-project/graphite-web/issues/2423, or mute the thread https://github.com/notifications/unsubscribe-auth/ACxrWjO8uDkXCUXMH4GTe5OEjClHPM4_ks5vKItxgaJpZM4ah4yT .

DanCech commented 5 years ago

Is your carbon-cache able to flush data points to disk? CarbonLink is responsible for returning points that have not yet been flushed to disk, so if that number is going up then it would indicate that there is a problem preventing carbon-cache from updating the whisper files.

JosephFY commented 5 years ago

I am actually doing some testing on this box before going to prod. So let me explained a little bit. I am intentially lowering MAX_UPDATES_PER_SECOND = 50 so I can get the data from CarbonLink only thus from memory, I wanted the results to get back to me as fast as possible. I am sending around 2.2Mil metrics/Minute to this box and been steady with no issues for about 2 weeks.

I have 8 carbons [a-h], each has a cache size of ~12 Mil and ~ 100 points per update.

This is the graph: image

Query inspector on Grafana: image

Results on cache.log

2019-02-04,13:09:10.619 :: CarbonLink cache-query request for xxxxxxxx returned 15 datapoints
2019-02-04,13:09:49.201 :: CarbonLink cache-query request for xxxxxxxx returned 16 datapoints
2019-02-04,13:10:11.215 :: CarbonLink cache-query request for xxxxxxxx returned 16 datapoints
2019-02-04,13:10:15.786 :: CarbonLink cache-query request for xxxxxxxx returned 16 datapoints
2019-02-04,13:10:23.640 :: CarbonLink cache-query request for xxxxxxxx returned 16 datapoints
2019-02-04,13:12:13.741 :: CarbonLink cache-query request for xxxxxxxx returned 18 datapoints
2019-02-04,13:12:31.016 :: CarbonLink cache-query request for xxxxxxxx returned 18 datapoints
2019-02-04,13:12:39.613 :: CarbonLink cache-query request for xxxxxxxx returned 19 datapoints
2019-02-04,13:12:48.084 :: CarbonLink cache-query request for xxxxxxxx returned 19 datapoints
2019-02-04,13:12:54.276 :: CarbonLink cache-query request for xxxxxxxx returned 19 datapoints
2019-02-04,13:13:18.224 :: CarbonLink cache-query request for xxxxxxxx returned 19 datapoints
2019-02-04,13:13:28.463 :: CarbonLink cache-query request for xxxxxxxx returned 19 datapoints
2019-02-04,13:18:28.642 :: CarbonLink cache-query request for xxxxxxxx returned 24 datapoints
2019-02-04,13:20:51.593 :: CarbonLink cache-query request for xxxxxxxx returned 27 datapoints
2019-02-04,13:29:19.423 :: CarbonLink cache-query request for xxxxxxxx returned 35 datapoints
2019-02-04,13:29:27.626 :: CarbonLink cache-query request for xxxxxxxx returned 35 datapoints
2019-02-04,13:29:39.134 :: CarbonLink cache-query request for xxxxxxxx returned 36 datapoints
2019-02-04,14:03:58.352 :: CarbonLink cache-query request for xxxxxxxx returned 70 datapoints
2019-02-04,14:04:08.129 :: CarbonLink cache-query request for xxxxxxxx returned 70 datapoints
2019-02-04,14:05:19.516 :: CarbonLink cache-query request for xxxxxxxx returned 71 datapoints

Multiple duplicate values (like returned 70 datapoints) occured when I refresh multiple times within the minute and no new metrics came yet.

DanCech commented 5 years ago

Ok, so then you are seeing what you'd expect, carbonlink returning the points for the requested series that haven't been flushed to disk yet.

JosephFY commented 5 years ago

yea but my concern is, why showing 71 data points where the graph has only 5

DanCech commented 5 years ago

Because the carbonlink protocol is quite simple, graphite-web asks carbon-cache for all the datapoints it hasn't flushed to disk yet for series X, and it returns them. You're artificially creating a scenario that in production would indicate that there is a problem with your carbon-cache configuration, and in any case the delta on returning 5 or 50 points is going to be lost in the noise of tcp overhead and round-trip latency.

JosephFY commented 5 years ago

I see, Thanks @DanCech for clarifying, I was assuming that the number should match the exact number of datapoints I'm querying for, looks like it's returning all the values for series X in the cache as you described. It's clear now.

I know this should be in saperate question but the reason I dug through the cach.log is to enhance the queries time on our metrics. So we have ~10K servers and our metrics look like

servers.serverX.status and it returns either 0 or 1 depending on servers availability.

Often we will need to apply different functions on those metrics (and other) if we do servers.serverX.status|currentbelow(0.9)|countSeries() the query is taking 10-15 seconds to finish and I'm almost sure all the data is actually getting queried from carbonLink thus from memory

We have a 16 cores and 32 Gb memory with SSD 5K IO vm using WSGIDaemonProcess graphite processes=8 threads=16 display-name='%{GROUP}' inactivity-timeout=120 on graphite-vhost.conf

I know mialge may vary but what would you suggest to speed up these type of queries ? especially that we need to list those currentBelow() results in a table in grafana or applying aggregate('count') for historical graphs. I turned on memcached and that helped when someone asking for the exact same query, but it's not helping when the query has to refresh for the first time on a Grafana dashboard.

Any advice?

piotr1212 commented 5 years ago

How many caches are you running?

JosephFY commented 5 years ago

I have 8 carbons [a-h], each has a cache size of ~12 Mil and ~ 100 points per update.

piotr1212 commented 5 years ago

I'd double that, to the same as the number of cores, they'll have more time responding to your queries. You now got 128 graphite-web threads querying 8 carbon threads (actually more but only one can run at the same time).

Or try out go-carbon.

JosephFY commented 5 years ago

I will try doubling and get back with results, so you think the bottleneck is on the carbon side not the graphite-web side ?

YOu think adding more process/workers for WSGIDaemonProcess would help ? Is go-carbon ready for production environment? I take it it will only replace the carbon-cache not graphite-web and whisper?

piotr1212 commented 5 years ago

I will try doubling and get back with results, so you think the bottleneck is on the carbon side not the graphite-web side ?

I don't know, just run top and see how much cpu they use. due to Python's GIL they cannot use more than one core.

YOu think adding more process/workers for WSGIDaemonProcess would help ?

You only know when you try but my guess would be no. 128 seems a lot already.

Is go-carbon ready for production environment?

Booking is using it on production for some years already, they store a few million metrics/second.