graphite-project / carbon

Carbon is one of the components of Graphite, and is responsible for receiving metrics over the network and writing them down to disk using a storage backend.
http://graphite.readthedocs.org/
Apache License 2.0
1.5k stars 490 forks source link

Please suggest carbon config settings to eliminate metric drops #870

Closed satish-chef closed 4 years ago

satish-chef commented 4 years ago

I am running a Graphite server in production in which data is sent from 10 AWS regions using carbon-relay-ng. The version of whisper is 0.9.12, CentOS version is 6.10. The Graphite server is hosted in AWS cloud in an EC2 instance of type c4.8xlarge( 60 GB RAM & 36 CPU cores ). The Graphite storage "/opt/graphite/storage" is mounted on a 6 TB GP2 EBS volume with 16000 IOPS.

Below is the content of /opt/graphite/conf/carbon.conf on the Graphite server:

 [cache]
WHISPER_FALLOCATE_CREATE = True
WHISPER_FADVISE_RANDOM = False
CERES_NODE_CACHING_BEHAVIOR = none

GRAPHITE_ROOT = /opt/graphite
GRAPHITE_CONF_DIR = /opt/graphite/conf
GRAPHITE_STORAGE_DIR = /opt/graphite/storage
LOCAL_DATA_DIR = /opt/graphite/storage/whisper
USER = apache
MAX_CACHE_SIZE = inf
MAX_UPDATES_PER_SECOND = 14000
MAX_CREATES_PER_MINUTE = 60000

ENABLE_TCP_LISTENER = True
LINE_RECEIVER_INTERFACE = 0.0.0.0
LINE_RECEIVER_PORT = 2103

ENABLE_UDP_LISTENER = True
UDP_RECEIVER_INTERFACE = 0.0.0.0
UDP_RECEIVER_PORT = 2103

ENABLE_PICKLE_LISTENER = True
PICKLE_RECEIVER_INTERFACE = 0.0.0.0
PICKLE_RECEIVER_PORT = 2104

USE_INSECURE_UNPICKLER = False

CACHE_QUERY_INTERFACE = 0.0.0.0
CACHE_QUERY_PORT = 7002

USE_FLOW_CONTROL = True
LOG_UPDATES = False

WHISPER_AUTOFLUSH = False

ENABLE_MANHOLE = False
# Example: store everything
# BIND_PATTERNS = #

[cache:b]
WHISPER_FALLOCATE_CREATE = True
WHISPER_FADVISE_RANDOM = False
CERES_NODE_CACHING_BEHAVIOR = none

GRAPHITE_ROOT = /opt/graphite
GRAPHITE_CONF_DIR = /opt/graphite/conf
GRAPHITE_STORAGE_DIR = /opt/graphite/storage
LOCAL_DATA_DIR = /opt/graphite/storage/whisper
USER = apache
MAX_CACHE_SIZE = inf
MAX_UPDATES_PER_SECOND = 14000
MAX_CREATES_PER_MINUTE = 60000
CACHE_QUERY_PORT = 7102

ENABLE_TCP_LISTENER = True
LINE_RECEIVER_INTERFACE = 0.0.0.0
LINE_RECEIVER_PORT = 2203

ENABLE_UDP_LISTENER = True
UDP_RECEIVER_INTERFACE = 0.0.0.0
UDP_RECEIVER_PORT = 2203

ENABLE_PICKLE_LISTENER = True
PICKLE_RECEIVER_INTERFACE = 0.0.0.0
PICKLE_RECEIVER_PORT = 2204

USE_INSECURE_UNPICKLER = False

CACHE_QUERY_INTERFACE = 0.0.0.0

USE_FLOW_CONTROL = True
LOG_UPDATES = False

WHISPER_AUTOFLUSH = False

ENABLE_MANHOLE = False
# Example: store everything
# BIND_PATTERNS = #

[relay]
USER = apache
LINE_RECEIVER_INTERFACE = 0.0.0.0
LINE_RECEIVER_PORT = 2003
UDP_RECEIVER_INTERFACE = 0.0.0.0
UDP_RECEIVER_PORT = 2003
PICKLE_RECEIVER_INTERFACE = 0.0.0.0
PICKLE_RECEIVER_PORT = 2004
MAX_DATAPOINTS_PER_MESSAGE = 4000
MAX_QUEUE_SIZE = 15000
# Set this to False to drop datapoints when any send queue (sending datapoints
# to a downstream carbon daemon) hits MAX_QUEUE_SIZE. If this is True (the
# default) then sockets over which metrics are received will temporarily stop accepting
# data until the send queues fall below 80% MAX_QUEUE_SIZE.
USE_FLOW_CONTROL = True
# Local relay sharding to multiple carbon-cache on localhost, we only support
# sharding in this config.  Change $targets if needed
RELAY_METHOD = consistent-hashing
REPLICATION_FACTOR = 1
DESTINATIONS = 127.0.0.1:2104,localhost:2204

Below is the Graph of committed points:

image

Below is the Graph of metrics received for both carbon-caches:

image

Below are some other graphs that might help you assess the situation:

image

I am observing metrics dropped for all metrics. The frequency of metric drops is 12-20 every 12 hours. Metric drop for some metrics is very less like 2 every 12 hours. Can someone please suggest configuration settings for carbon-cache and/or carbon-relay which can help me eliminate metric drop issue. Thanks in advance !

deniszh commented 4 years ago

Hi @satish-chef ,

MAX_QUEUE_SIZE = 15000 looks very small. Try 100000.

satish-chef commented 4 years ago

Thanks @deniszh a ton for your response and your solution worked like a magic bullet for me. It eliminated the metric drop issue for me. I had to tweak some values of MAX_UPDATES_PER_SECOND and MAX_CREATES_PER_MINUTE as per my traffic and also i wanted to keep cache size at a minimum if not zero which is suggested in this blog - https://nav.uninett.no/doc/4.8/faq/graph_gaps.html BTW, let me know what you think about this blog. Is it advisable to keep the cache size to zero or at a minimum value possible.

satish-chef commented 4 years ago

Also @deniszh It would be helpful if you could suggest me the optimum settings in carbon.conf in case the metric traffic increases to 25K. Right now its 12K for our Infra. I would need suggestion for below settings:

[cache]
MAX_CACHE_SIZE = inf
MAX_UPDATES_PER_SECOND = 20000
MAX_CREATES_PER_MINUTE = 200000
[cache:b]
MAX_CACHE_SIZE = inf
MAX_UPDATES_PER_SECOND = 20000
MAX_CREATES_PER_MINUTE = 200000
[relay]
MAX_DATAPOINTS_PER_MESSAGE = 4000
MAX_QUEUE_SIZE = 150000

Thanks in advance !

satish-chef commented 4 years ago

Hello @deniszh @piotr1212 ,

Can you please suggest me config changes in my carbon.conf which would help me in reducing carbon cache size. I have tried many settings of MAX_DATAPOINTS_PER_MESSAGE and MAX_QUEUE_SIZE in [relay] section. But still the "carbon cache size" and "carbon cache queue" is still high.

Below is the content of /opt/graphite/conf/carbon.conf on the EC2 instance:

[cache]
WHISPER_FALLOCATE_CREATE = True
WHISPER_FADVISE_RANDOM = False

GRAPHITE_ROOT = /opt/graphite
GRAPHITE_CONF_DIR = /opt/graphite/conf
GRAPHITE_STORAGE_DIR = /opt/graphite/storage
LOCAL_DATA_DIR = /opt/graphite/storage/whisper
USER = apache
MAX_CACHE_SIZE = inf
MAX_UPDATES_PER_SECOND = 20000
MAX_CREATES_PER_MINUTE = 200000

ENABLE_TCP_LISTENER = True
LINE_RECEIVER_INTERFACE = 0.0.0.0
LINE_RECEIVER_PORT = 2103

ENABLE_UDP_LISTENER = True
UDP_RECEIVER_INTERFACE = 0.0.0.0
UDP_RECEIVER_PORT = 2103

ENABLE_PICKLE_LISTENER = True
PICKLE_RECEIVER_INTERFACE = 0.0.0.0
PICKLE_RECEIVER_PORT = 2104

USE_INSECURE_UNPICKLER = False

CACHE_QUERY_INTERFACE = 0.0.0.0
CACHE_QUERY_PORT = 7002

USE_FLOW_CONTROL = True
LOG_UPDATES = False

WHISPER_AUTOFLUSH = False

ENABLE_MANHOLE = False
# Example: store everything
# BIND_PATTERNS = #

[cache:b]
WHISPER_FALLOCATE_CREATE = True
WHISPER_FADVISE_RANDOM = False

GRAPHITE_ROOT = /opt/graphite
GRAPHITE_CONF_DIR = /opt/graphite/conf
GRAPHITE_STORAGE_DIR = /opt/graphite/storage
LOCAL_DATA_DIR = /opt/graphite/storage/whisper
USER = apache
MAX_CACHE_SIZE = inf
MAX_UPDATES_PER_SECOND = 20000
MAX_CREATES_PER_MINUTE = 200000
CACHE_QUERY_PORT = 7102

ENABLE_TCP_LISTENER = True
LINE_RECEIVER_INTERFACE = 0.0.0.0
LINE_RECEIVER_PORT = 2203

ENABLE_UDP_LISTENER = True
UDP_RECEIVER_INTERFACE = 0.0.0.0
UDP_RECEIVER_PORT = 2203

ENABLE_PICKLE_LISTENER = True
PICKLE_RECEIVER_INTERFACE = 0.0.0.0
PICKLE_RECEIVER_PORT = 2204

USE_INSECURE_UNPICKLER = False

CACHE_QUERY_INTERFACE = 0.0.0.0

USE_FLOW_CONTROL = True
LOG_UPDATES = False

WHISPER_AUTOFLUSH = False

ENABLE_MANHOLE = False
# Example: store everything
# BIND_PATTERNS = #

[relay]
USER = apache
LINE_RECEIVER_INTERFACE = 0.0.0.0
LINE_RECEIVER_PORT = 2003
UDP_RECEIVER_INTERFACE = 0.0.0.0
UDP_RECEIVER_PORT = 2003
PICKLE_RECEIVER_INTERFACE = 0.0.0.0
PICKLE_RECEIVER_PORT = 2004
MAX_DATAPOINTS_PER_MESSAGE = 2800
MAX_QUEUE_SIZE = 220000
# Set this to False to drop datapoints when any send queue (sending datapoints
# to a downstream carbon daemon) hits MAX_QUEUE_SIZE. If this is True (the
# default) then sockets over which metrics are received will temporarily stop accepting
# data until the send queues fall below 80% MAX_QUEUE_SIZE.
USE_FLOW_CONTROL = True
# Local relay sharding to multiple carbon-cache on localhost, we only support
# sharding in this config.  Change $targets if needed
RELAY_METHOD = consistent-hashing
REPLICATION_FACTOR = 1
DESTINATIONS = 127.0.0.1:2104,localhost:2204

Below is the graph for carbon cache queue and carbon cache size:

image

I am trying to reduce carbon cache size to zero as suggested in this blog.

Also I would highly appreciate if you could reply to my earlier query. Thanks in Advance !!

ploxiln commented 4 years ago

I don't think that document suggests reducing the carbon cache size to zero. I think this is the relevant section:

If the carbon-cache daemon (or daemons, if you have configured multiple) is unable to write data to your storage medium at a fast enough rate, its internal cache will be saturated, and it will start to drop incoming metrics. This will typically happen if the volume and rate of incoming metrics is larger than your I/O subsystem can support writing.

...

This graph shows the relationship between incoming data points, and datapoints committed to disk, while superimposing the size of the internal cache on top. You should be able to quickly identify any capacity issues here: The rate if incoming data points is continuously higher than the rate of committed points, and the cache size is ever-increasing (until it at some points hits the max cache size, configured in carbon.conf).

...

The only way around this is to scale up your Graphite infrastructure. You can add faster drives (solid state drives aren’t a bad idea), or set up a cluster of multiple Graphite servers.

I think that if the carbon cache size is not continuously growing, and it is able to keep up with writing metrics to disk and not dropping any, then all is OK.

satish-chef commented 4 years ago

Thanks @ploxiln , i get your point. I have usually observed that the metric committed increase the next minute when carbon cache is high. This causes delay of 1 minute for few metrics which I was aiming to eliminate.

Anyways, if you could suggest carbon configs that i asked in the comment one row above, i would really appreciate it.

deniszh commented 4 years ago

@satish-chef :

But still the "carbon cache size" and "carbon cache queue" is still high.

I do not see any high usage. You have 150K metric/min, and 30K of them in cache, so, ~ 12 seconds of them? IMO it's quite good result, but if you want to improve it try to set MAX_CACHE_SIZE to some sane value, (50K - 100K - 200K, pick one), then increase MAX_UPDATES_PER_SECOND until cache size stop decreasing - that means that you reach limit of your disk performance. Of course, you'll need to restart carbon caches every time to apply parameters and wait some time for stabilizing queue size, so, I don't know is all efforts still worth it.

satish-chef commented 4 years ago

Hi @deniszh , I tried changing MAX_CACHE_SIZE to 50K and then 100K after 2 minutes. The result was carbon cache dropped to zero and so did the number of metrics. This really backfired. So I am stopping my optimisation efforts as of now and closing this issue as things look stable on the Graphite server. Big thanks to everyone in this thread to who helped me out.