graphite-project / carbon

Carbon is one of the components of Graphite, and is responsible for receiving metrics over the network and writing them down to disk using a storage backend.
http://graphite.readthedocs.org/
Apache License 2.0
1.51k stars 490 forks source link

Interesting behaviour regarding MAX_CACHE_SIZE #664

Closed Pheels closed 7 years ago

Pheels commented 7 years ago

Hi, I wanted to share some interesting behaviour I've observed over the past few days since increasing MAX_CACHE_SIZE to inf. First, some information about my Graphite setup:

My setup is clustered, with a top level carbon-relay filtering metrics through to 3 dedicated Carbon Cache machines. The top relay receives around 300k metrics per minute, and ships them to the caches with a replication factor of 2.

The motivation behind this change was due to high CPU and low memory utilisation on each box. Previously, MAX_CACHE_SIZE was set to 15000000. Once this was changed this to inf, I observed some positive changes in my carbon metrics, and some that I'm struggling to explain. Note, each of the lines in the following diagrams are carbon process on one of the 4 mentioned machines.

Firstly, CPU has decreased drastically: screen shot 2017-06-05 at 11 18 45

However, this has been coupled with erratic changes to memory usage - note, the top three metrics come from the same machine: screen shot 2017-06-05 at 11 22 41

It's also worth adding that the cache sizes have increased from < 1000 to over 150k on this problematic machine, whilst remaining stable on the other two. This has been met with a drop in update operations and committed points.

Finally, and the reason that I'm filing this issue is I've observed a visible drop in metrics received (the yellow line being the top relay): screen shot 2017-06-05 at 11 33 56

I can say with absolute certainty that I am still shipping exactly the same number of metrics as I was before the change and therefore I think it's quite clear that one of my machine has/is dropping metrics since the config change. Whether this is a software or hardware issue I am unsure, but all three of these cache machines are the same spec.

I guess what I came here to ask is has anybody seen this issue before? I don't want to spend time swapping the machine out only to realise that it's a software bug, or a problem with my config. Any help is greatly appreciated.

Thanks, Oliver.

Pheels commented 7 years ago

For reference, heres my carbon.conf - everything not included is commented out:


ENABLE_LOGROTATION = True

USER = _graphite

MAX_CACHE_SIZE = inf

MAX_UPDATES_PER_SECOND = 1000

MAX_CREATES_PER_MINUTE = inf

LINE_RECEIVER_INTERFACE = 127.0.0.1
#LINE_RECEIVER_PORT = 2003

PICKLE_RECEIVER_INTERFACE = 127.0.0.1
PICKLE_RECEIVER_PORT = 2004

USE_INSECURE_UNPICKLER = False

USE_FLOW_CONTROL = True

LOG_UPDATES = True
LOG_CACHE_HITS = False
LOG_CACHE_QUEUE_SORTS = False

CACHE_WRITE_STRATEGY = naive

WHISPER_AUTOFLUSH = False

WHISPER_FALLOCATE_CREATE = True

[cache:a]
LINE_RECEIVER_PORT = 2013
PICKLE_RECEIVER_PORT = 2014
CACHE_QUERY_PORT = 7012

[cache:b]
LINE_RECEIVER_PORT = 2023
PICKLE_RECEIVER_PORT = 2024
CACHE_QUERY_PORT = 7022

[cache:c]
LINE_RECEIVER_PORT = 2033
PICKLE_RECEIVER_PORT = 2034
CACHE_QUERY_PORT = 7032

[cache:d]
LINE_RECEIVER_PORT = 2043
PICKLE_RECEIVER_PORT = 2044
CACHE_QUERY_PORT = 7042

[relay]
LINE_RECEIVER_INTERFACE = 0.0.0.0
LINE_RECEIVER_PORT = 2003
PICKLE_RECEIVER_INTERFACE = 0.0.0.0
PICKLE_RECEIVER_PORT = 2004

LOG_LISTENER_CONNECTIONS = True

RELAY_METHOD = consistent-hashing
REPLICATION_FACTOR = 1

DESTINATIONS = 127.0.0.1:2014:a, 127.0.0.1:2024:b, 127.0.0.1:2034:c, 127.0.0.1:2044:d

MAX_DATAPOINTS_PER_MESSAGE = 500
MAX_QUEUE_SIZE = 10000

USE_FLOW_CONTROL = True

USE_WHITELIST = True
deniszh commented 7 years ago

Hello @olivermc1 Please note that carbon_ch hash has quite big difference in metric distribution - it can be 20-30% - https://github.com/graphite-project/carbon/issues/485 Also, sometimes metrics are also not evenly distributed by time - so, the maybe problematic machine just getting more metrics?

Another issue - I completely don't know how carbon working if the cache is set to Inf, but USE_FLOW_CONTROL = True. If someone knows - please explain, I'm very interested. That's why I usually recommend setting MAX_CACHE_SIZE to some big but sane value. Looks like 15 000 000 is too low, try to increase 10 times, to 150 000 000.

The third issue is very strange, I have no explanation too. That's a number of metrics on an incoming relay, right? How can it drop then?

deniszh commented 7 years ago

PS: MAX_CACHE_SIZE is not in bytes, it's in data points, so, approximately 12 bytes each.

Pheels commented 7 years ago

Hi Deniszh, Thanks for the response.

Prior to this config change, each machine was receiving an even number of metrics (within 5%), which is why the discrepancies seem so unnatural.

Since this morning I've had the MAX_CACHE_SIZE set to 30000000, which has resulted in the memory usage and cache queues stabilising and the CPU rising back up - the reason I made this config change in the first place.

Purple and blue lines are the problematic machine again: screen shot 2017-06-05 at 16 20 45

and as you can see, the memory use has now been capped: screen shot 2017-06-05 at 16 22 33

deniszh commented 7 years ago

Then I would say that something is wrong with that server, indeed.

Pheels commented 7 years ago

Imported obfuscurity's extended dashboard and thought this graph really highlights the problem - my 8 other caches have less than 1k datapoints whereas the 4 on this machine have 12-15k. Clearly this is an abundance of metrics stuck in these caches.

screen shot 2017-06-06 at 16 31 25

Pheels commented 7 years ago

Closing this - was indeed an underlying hardware issue. Swapped out the machine and now all three are performing similarly.