graphite-project / carbon

Carbon is one of the components of Graphite, and is responsible for receiving metrics over the network and writing them down to disk using a storage backend.
http://graphite.readthedocs.org/
Apache License 2.0
1.5k stars 490 forks source link

Bucketmax write strategy #879

Closed piotr1212 closed 4 years ago

piotr1212 commented 4 years ago

Bucketmax writes the metric with the most datapoints just as max does but uses a different approach to determine which metric has the most datapoints.

At each write operation 'max' loops once over the list of metrics and picks the metric with most datapoints.

'bucketmax' stores all metrics in a list where each index represents the number of datapoints in the specific metrics. For example buckets[5] contains a list of all metrics which have 6 datapoints. At each write operation 'bucketmax' selects the first metric from the largest bucket.

The time complexity of 'max' is linear, 'bucketmax' is near constant time but runs on every ingestion where 'max' only runs on every write. There is also some overhead in resizing the buckets list. I've done some limited testing and 'bucketmax'' CPU usage was almost half of that of 'max'. The constraint of bucketmax is that the number of datapoints per metrics must be relatively low compared to the total number of unique metrics (which I think is the most common use-case in Graphite usage). I don't know where the tipping point is but if one has for example 10 metrics with each 100000 datapoints I would assume 'bucketmax' would perform worse than 'max'.

deniszh commented 4 years ago

Great idea, I like that, @piotr1212 ! For me it looks like much proper max stratege, I mean, for typical use-case.

DanCech commented 4 years ago

I don't see where the metric is removed from the associated bucket when it is flushed?

I'm also concerned about the impact on memory usage of maintaining the buckets, since you're going to have a second copy of all the metric names in the bucket data structure.

piotr1212 commented 4 years ago

I don't see where the metric is removed from the associated bucket when it is flushed?

It is being popped here: https://github.com/graphite-project/carbon/pull/879/files#diff-b95553eddf4a9a25750a9a1d17f2be0cR161

I'm also concerned about the impact on memory usage of maintaining the buckets, since you're going to have a second copy of all the metric names in the bucket data structure.

Good point, i'll see if I can store pointers instead of the full string somehow.

piotr1212 commented 4 years ago

It seems that Python does not copy the metric name but just stores a reference to the string object.

See below for a small simulation, the refcount to the metric name increases when I add it to the buckets. The size of the buckets is smaller than the actual size of the string.

In [62]: a_metric_name = 'a very long string representing a metric name in graphite, foobar, some more text, even more and more and more and more'            

In [63]: sys.getsizeof(a_metric_name)  
Out[63]: 168

In [64]: sys.getrefcount(a_metric_name)
Out[64]: 2

In [65]: a_metric_cache = dict()       

In [66]: sys.getsizeof(a_metric_cache) 
Out[66]: 248

In [67]: a_metric_cache[a_metric_name] = [1, 2, 3, 4, 5]                       

In [68]: sys.getsizeof(a_metric_cache) 
Out[68]: 248

In [69]: sys.getrefcount(a_metric_name)
Out[69]: 3

In [70]: buckets = list()              

In [71]: sys.getsizeof(buckets)        
Out[71]: 72

In [72]: buckets.append(a_metric_name) 

In [73]: sys.getsizeof(buckets)        
Out[73]: 104

In [74]: sys.getrefcount(a_metric_name)
Out[74]: 4
DanCech commented 4 years ago

That's good to know. The only other comment I'd have would be around how we're now mingling handling between the strategy and the cache, and that the cache has strategy-specific code in it.

One thought was that the buckets could be defined within the strategy and a method on the strategy called each time we add a new point so that it could do any housekeeping. That would remove the strategy-specific code from the cache.

lgtm-com[bot] commented 4 years ago

This pull request introduces 2 alerts when merging 84f7907157c9ef9a32f9d8a598364ecd0ac9302f into 65127e539ddc219f6cc7eda67fc7806c6a3e4717 - view on LGTM.com

new alerts:

piotr1212 commented 4 years ago

That's good to know. The only other comment I'd have would be around how we're now mingling handling between the strategy and the cache, and that the cache has strategy-specific code in it.

One thought was that the buckets could be defined within the strategy and a method on the strategy called each time we add a new point so that it could do any housekeeping. That would remove the strategy-specific code from the cache.

Thank you for the suggestion. Updated it, code looks much cleaner now.

piotr1212 commented 4 years ago

squashed the commit

deniszh commented 4 years ago

Cool! will try to run in on https://github.com/graphite-project/graphite-test-docker this weekend

deniszh commented 4 years ago

Ok, I made graph below on my MBP 2017 (docker had 4 cores/6GB RAM) using https://github.com/graphite-project/graphite-test-docker haggar generated -agents=10 -metrics=10000 = 100K metrics

MAX_CACHE_SIZE = Inf
MAX_UPDATES_PER_SECOND = 5000
MAX_CREATES_PER_MINUTE = 50

CACHE_WRITE_STRATEGY = max

Screenshot 2020-02-08 at 18 49 27 Screenshot 2020-02-08 at 18 49 40 Screenshot 2020-02-08 at 18 50 04

CACHE_WRITE_STRATEGY = bucketmax

Screenshot 2020-02-08 at 23 35 57 Screenshot 2020-02-08 at 23 36 15 Screenshot 2020-02-08 at 23 36 33

I do not see much increased memory consumption or something unusual. I think it's OK to merge.

piotr1212 commented 4 years ago

But CPU is nearly identical, memory usage is higher. That is quite different from what I measured. I'll do some more testing

piotr1212 commented 4 years ago

50k metrics every 10s, metrics were already existing.

carbon.conf

MAX_CACHE_SIZE = inf
MAX_UPDATES_PER_SECOND = 500
MAX_CREATES_PER_MINUTE = 1000000

load generator is https://github.com/Civil/graphite_perf_test_go

~/go/bin/graphite_perf_test_go -host 127.0.0.1:2003 -connec
tions 50 -interval 10                                                          
2020/02/09 16:45:33 Starting...                                                
2020/02/09 16:45:33 ===New iteration===                  
2020/02/09 16:45:33 Load    :  50 x 1000 = 50000  metrics
2020/02/09 16:45:34 Spent   :  0.043315 seconds          
2020/02/09 16:45:34 Speed   :  1154334.5 metrics/second  
2020/02/09 16:45:34 Sleeping:  9.956684 seconds          

first 30m is bucketmax, then 15m max, then bucketmax again. You see a significant lower CPU usage with bucketmax.

b_max

deniszh commented 4 years ago

But CPU is nearly identical, memory usage is higher. That is quite different from what I measured. I'll do some more testing

Well, memory is the same, I just waited more time. But you're right, some owerflowing test has much more sense here.

piotr1212 commented 4 years ago

Ok, merging this.

Dieterbe commented 4 years ago

Is there an explanation for the significantly reduced pointsPerUpdate? I thought the behavior would be equivalent.

piotr1212 commented 4 years ago

With the 'max' strategy CPU goes to 100%. So I assume it does less write operations which results in larger queues so more points per update. I probably should have added the 'write_operations' to the graph and do a test where it does not overload the CPU.

On Thu, 21 May 2020, 20:45 Dieter Plaetinck, notifications@github.com wrote:

Is there an explanation for the significantly reduced pointsPerUpdate? I thought the behavior would be equivalent.

— You are receiving this because you modified the open/close state. Reply to this email directly, view it on GitHub https://github.com/graphite-project/carbon/pull/879#issuecomment-632277284, or unsubscribe https://github.com/notifications/unsubscribe-auth/AAWGWWSSJ7SFJPOV2CXJ76TRSVZE7ANCNFSM4KPXYTRQ .

Dieterbe commented 4 years ago

Ah that makes sense and explains why pointsPerUpdate grows linearly. This also means the memory usage with max in non overload scenario is likely about the same as bucketmax, judging from the start of this 'max's run