Closed maxwell-gregory closed 8 years ago
Changing CACHE_WRITE_STRATEGY
from naive to sorted fixed this. Any idea why? This was only happening on one instance. All other instances and nodes were fine.
@NoMotion there's no reasonable explanation for one host behaving differently. Generally speaking you want to have MAX_CACHE_SIZE
set to a non-inf
value anyways to maximize PPU. And of course to actually increase PPU you need UPDATES_PER_SECOND
to be lower than your current volume.
I have to assume that something is different between your configurations to cause the isolated behavior.
@obfuscurity Thanks for the reply. I felt exactly the same way, something must be different between configurations but we triple checked the configurations. Weird part is its only one instance thats misbehaving, and as you know all instances on a node share the same config file. We will play with it in the future since ultimately we want naive mode
Additionally i never heard of MAX_CACHE_SIZE
maximizing PPU, do you have a recommendation for cache size? We push 1.8 million metric/min and during high volume that can double, we were under the impression if cache size is met Graphite will throw away metrics, which is not desirable.
I don't have a hard and fast rule for setting your MAX_CACHE_SIZE
but you want to avoid setting it to inf
if there's a chance an outage or bottleneck will cause you to hit it (resulting in an OOM condition and crash). Generally speaking it should be some value representing the number of metrics you're comfortable storing in memory for some temporary amount of time. Obviously any metrics that can't eventually flush to disk (either due to crash or unavailable downstream writer) will be lost.
Personally I used to always run it at inf
, but I was also running it on bare metal. These days a lot of folks try to run it in VMs or containers, and in those cases it's very important to limit the memory use.
Here is an excerpt on MAX_CACHE_SIZE
from the Graphite book:
"This setting determines the amount of memory carbon-cache
should be allowed to use for caching. The default value (inf) assumes that your disk subsystem can keep up with the amount of creates and updates performed by carbon-cache
. If your write performance is poor, it's a good idea to set this to an finite number representing the number of datapoints to store in memory (not the size in bytes). If the cache hits this limit, datapoints may be lost until the buffer catches up on writes. Each datapoint in Whisper is 12 bytes large, so you should be able to calculate some safe numbers based on your available memory. If you're running Graphite in a heavily utilized virtual host, I strongly encourage you to set MAX_CACHE_SIZE
to a non-inf value."
@obfuscurity I shall propose this to our architect. Thank you for your valuable time on this issue. If I end up finding more evidence of this being a potential bug ill be sure to post it here.
Using the same configuration across our cluster, one carbon-cache instance is stuck at 1 PPU with its memory usage approaching infinity
In our environment we have 3 clustered backends nodes. Data gets passed from front-end to one carbon-relay via UDP, then distributed to 4 carbon-cache instances via pickle protocol local to the relay. Our environment uses consistent-hashing.
Problem arrises after upgrade to 9.15: one instance of carbon cache has same number of updateOperations but won't go pass 1 point per update (PPU). All other carbon-cache instances average about 5-6 PPU and behave as intended. Looking at the data, the number of updateOperations and metricsReceived they are the same across all instances. However committedPoints and PPU are too low on problem instance
Problem Instance
Healthy Instance
Notice the blue and red lines are close to the yellow line in the healthy instance
As a result this one instance's memory usage approaches infinity (since MAX_CACHE_SIZE is inf). Configuration is the same for all instances on all nodes and all other instances average 50-70MB memory. This instance is also receiving the same amount of points as all others in the cluster
Things we have tried: