graphite-project / carbon

Carbon is one of the components of Graphite, and is responsible for receiving metrics over the network and writing them down to disk using a storage backend.
http://graphite.readthedocs.org/
Apache License 2.0
1.51k stars 490 forks source link

Carbon cache instance stuck at one ppu #566

Closed maxwell-gregory closed 8 years ago

maxwell-gregory commented 8 years ago

Using the same configuration across our cluster, one carbon-cache instance is stuck at 1 PPU with its memory usage approaching infinity

In our environment we have 3 clustered backends nodes. Data gets passed from front-end to one carbon-relay via UDP, then distributed to 4 carbon-cache instances via pickle protocol local to the relay. Our environment uses consistent-hashing.

backend{1,2,3}
└── carbon-relay-a
    ├── carbon-cache-a
    ├── carbon-cache-b
    ├── carbon-cache-c
    └── carbon-cache-d

Problem arrises after upgrade to 9.15: one instance of carbon cache has same number of updateOperations but won't go pass 1 point per update (PPU). All other carbon-cache instances average about 5-6 PPU and behave as intended. Looking at the data, the number of updateOperations and metricsReceived they are the same across all instances. However committedPoints and PPU are too low on problem instance

Problem Instance screenshot 2016-06-27 12 01 08

Healthy Instance screenshot 2016-06-27 11 59 47

Notice the blue and red lines are close to the yellow line in the healthy instance

As a result this one instance's memory usage approaches infinity (since MAX_CACHE_SIZE is inf). Configuration is the same for all instances on all nodes and all other instances average 50-70MB memory. This instance is also receiving the same amount of points as all others in the cluster

Things we have tried:

  1. Restarting the node
  2. Restarting all graphite modules (graphite-web, carbon-relay, carbon-caches)
  3. Altering UPDATES_PER_SECOND
  4. Log scouring, no exceptions or irregularities in kernel or any graphite module logs
maxwell-gregory commented 8 years ago

Changing CACHE_WRITE_STRATEGY from naive to sorted fixed this. Any idea why? This was only happening on one instance. All other instances and nodes were fine.

obfuscurity commented 8 years ago

@NoMotion there's no reasonable explanation for one host behaving differently. Generally speaking you want to have MAX_CACHE_SIZE set to a non-inf value anyways to maximize PPU. And of course to actually increase PPU you need UPDATES_PER_SECOND to be lower than your current volume.

I have to assume that something is different between your configurations to cause the isolated behavior.

maxwell-gregory commented 8 years ago

@obfuscurity Thanks for the reply. I felt exactly the same way, something must be different between configurations but we triple checked the configurations. Weird part is its only one instance thats misbehaving, and as you know all instances on a node share the same config file. We will play with it in the future since ultimately we want naive mode

Additionally i never heard of MAX_CACHE_SIZE maximizing PPU, do you have a recommendation for cache size? We push 1.8 million metric/min and during high volume that can double, we were under the impression if cache size is met Graphite will throw away metrics, which is not desirable.

obfuscurity commented 8 years ago

I don't have a hard and fast rule for setting your MAX_CACHE_SIZE but you want to avoid setting it to inf if there's a chance an outage or bottleneck will cause you to hit it (resulting in an OOM condition and crash). Generally speaking it should be some value representing the number of metrics you're comfortable storing in memory for some temporary amount of time. Obviously any metrics that can't eventually flush to disk (either due to crash or unavailable downstream writer) will be lost.

Personally I used to always run it at inf, but I was also running it on bare metal. These days a lot of folks try to run it in VMs or containers, and in those cases it's very important to limit the memory use.

Here is an excerpt on MAX_CACHE_SIZE from the Graphite book:

"This setting determines the amount of memory carbon-cache should be allowed to use for caching. The default value (inf) assumes that your disk subsystem can keep up with the amount of creates and updates performed by carbon-cache. If your write performance is poor, it's a good idea to set this to an finite number representing the number of datapoints to store in memory (not the size in bytes). If the cache hits this limit, datapoints may be lost until the buffer catches up on writes. Each datapoint in Whisper is 12 bytes large, so you should be able to calculate some safe numbers based on your available memory. If you're running Graphite in a heavily utilized virtual host, I strongly encourage you to set MAX_CACHE_SIZE to a non-inf value."

maxwell-gregory commented 8 years ago

@obfuscurity I shall propose this to our architect. Thank you for your valuable time on this issue. If I end up finding more evidence of this being a potential bug ill be sure to post it here.