Closed Pheels closed 7 years ago
For reference, heres my carbon.conf - everything not included is commented out:
ENABLE_LOGROTATION = True
USER = _graphite
MAX_CACHE_SIZE = inf
MAX_UPDATES_PER_SECOND = 1000
MAX_CREATES_PER_MINUTE = inf
LINE_RECEIVER_INTERFACE = 127.0.0.1
#LINE_RECEIVER_PORT = 2003
PICKLE_RECEIVER_INTERFACE = 127.0.0.1
PICKLE_RECEIVER_PORT = 2004
USE_INSECURE_UNPICKLER = False
USE_FLOW_CONTROL = True
LOG_UPDATES = True
LOG_CACHE_HITS = False
LOG_CACHE_QUEUE_SORTS = False
CACHE_WRITE_STRATEGY = naive
WHISPER_AUTOFLUSH = False
WHISPER_FALLOCATE_CREATE = True
[cache:a]
LINE_RECEIVER_PORT = 2013
PICKLE_RECEIVER_PORT = 2014
CACHE_QUERY_PORT = 7012
[cache:b]
LINE_RECEIVER_PORT = 2023
PICKLE_RECEIVER_PORT = 2024
CACHE_QUERY_PORT = 7022
[cache:c]
LINE_RECEIVER_PORT = 2033
PICKLE_RECEIVER_PORT = 2034
CACHE_QUERY_PORT = 7032
[cache:d]
LINE_RECEIVER_PORT = 2043
PICKLE_RECEIVER_PORT = 2044
CACHE_QUERY_PORT = 7042
[relay]
LINE_RECEIVER_INTERFACE = 0.0.0.0
LINE_RECEIVER_PORT = 2003
PICKLE_RECEIVER_INTERFACE = 0.0.0.0
PICKLE_RECEIVER_PORT = 2004
LOG_LISTENER_CONNECTIONS = True
RELAY_METHOD = consistent-hashing
REPLICATION_FACTOR = 1
DESTINATIONS = 127.0.0.1:2014:a, 127.0.0.1:2024:b, 127.0.0.1:2034:c, 127.0.0.1:2044:d
MAX_DATAPOINTS_PER_MESSAGE = 500
MAX_QUEUE_SIZE = 10000
USE_FLOW_CONTROL = True
USE_WHITELIST = True
Hello @olivermc1 Please note that carbon_ch hash has quite big difference in metric distribution - it can be 20-30% - https://github.com/graphite-project/carbon/issues/485 Also, sometimes metrics are also not evenly distributed by time - so, the maybe problematic machine just getting more metrics?
Another issue - I completely don't know how carbon working if the cache is set to Inf, but USE_FLOW_CONTROL = True. If someone knows - please explain, I'm very interested. That's why I usually recommend setting MAX_CACHE_SIZE to some big but sane value. Looks like 15 000 000 is too low, try to increase 10 times, to 150 000 000.
The third issue is very strange, I have no explanation too. That's a number of metrics on an incoming relay, right? How can it drop then?
PS: MAX_CACHE_SIZE is not in bytes, it's in data points, so, approximately 12 bytes each.
Hi Deniszh, Thanks for the response.
Prior to this config change, each machine was receiving an even number of metrics (within 5%), which is why the discrepancies seem so unnatural.
Since this morning I've had the MAX_CACHE_SIZE set to 30000000, which has resulted in the memory usage and cache queues stabilising and the CPU rising back up - the reason I made this config change in the first place.
Purple and blue lines are the problematic machine again:
and as you can see, the memory use has now been capped:
Then I would say that something is wrong with that server, indeed.
Imported obfuscurity's extended dashboard and thought this graph really highlights the problem - my 8 other caches have less than 1k datapoints whereas the 4 on this machine have 12-15k. Clearly this is an abundance of metrics stuck in these caches.
Closing this - was indeed an underlying hardware issue. Swapped out the machine and now all three are performing similarly.
Hi, I wanted to share some interesting behaviour I've observed over the past few days since increasing MAX_CACHE_SIZE to inf. First, some information about my Graphite setup:
My setup is clustered, with a top level carbon-relay filtering metrics through to 3 dedicated Carbon Cache machines. The top relay receives around 300k metrics per minute, and ships them to the caches with a replication factor of 2.
The motivation behind this change was due to high CPU and low memory utilisation on each box. Previously, MAX_CACHE_SIZE was set to 15000000. Once this was changed this to inf, I observed some positive changes in my carbon metrics, and some that I'm struggling to explain. Note, each of the lines in the following diagrams are carbon process on one of the 4 mentioned machines.
Firstly, CPU has decreased drastically:
However, this has been coupled with erratic changes to memory usage - note, the top three metrics come from the same machine:
It's also worth adding that the cache sizes have increased from < 1000 to over 150k on this problematic machine, whilst remaining stable on the other two. This has been met with a drop in update operations and committed points.
Finally, and the reason that I'm filing this issue is I've observed a visible drop in metrics received (the yellow line being the top relay):
I can say with absolute certainty that I am still shipping exactly the same number of metrics as I was before the change and therefore I think it's quite clear that one of my machine has/is dropping metrics since the config change. Whether this is a software or hardware issue I am unsure, but all three of these cache machines are the same spec.
I guess what I came here to ask is has anybody seen this issue before? I don't want to spend time swapping the machine out only to realise that it's a software bug, or a problem with my config. Any help is greatly appreciated.
Thanks, Oliver.