Closed satish-chef closed 4 years ago
Hi @satish-chef ,
MAX_QUEUE_SIZE = 15000 looks very small. Try 100000.
Thanks @deniszh a ton for your response and your solution worked like a magic bullet for me. It eliminated the metric drop issue for me. I had to tweak some values of MAX_UPDATES_PER_SECOND and MAX_CREATES_PER_MINUTE as per my traffic and also i wanted to keep cache size at a minimum if not zero which is suggested in this blog - https://nav.uninett.no/doc/4.8/faq/graph_gaps.html BTW, let me know what you think about this blog. Is it advisable to keep the cache size to zero or at a minimum value possible.
Also @deniszh It would be helpful if you could suggest me the optimum settings in carbon.conf in case the metric traffic increases to 25K. Right now its 12K for our Infra. I would need suggestion for below settings:
[cache]
MAX_CACHE_SIZE = inf
MAX_UPDATES_PER_SECOND = 20000
MAX_CREATES_PER_MINUTE = 200000
[cache:b]
MAX_CACHE_SIZE = inf
MAX_UPDATES_PER_SECOND = 20000
MAX_CREATES_PER_MINUTE = 200000
[relay]
MAX_DATAPOINTS_PER_MESSAGE = 4000
MAX_QUEUE_SIZE = 150000
Thanks in advance !
Hello @deniszh @piotr1212 ,
Can you please suggest me config changes in my carbon.conf which would help me in reducing carbon cache size. I have tried many settings of MAX_DATAPOINTS_PER_MESSAGE
and MAX_QUEUE_SIZE
in [relay]
section. But still the "carbon cache size" and "carbon cache queue" is still high.
Below is the content of /opt/graphite/conf/carbon.conf on the EC2 instance:
[cache]
WHISPER_FALLOCATE_CREATE = True
WHISPER_FADVISE_RANDOM = False
GRAPHITE_ROOT = /opt/graphite
GRAPHITE_CONF_DIR = /opt/graphite/conf
GRAPHITE_STORAGE_DIR = /opt/graphite/storage
LOCAL_DATA_DIR = /opt/graphite/storage/whisper
USER = apache
MAX_CACHE_SIZE = inf
MAX_UPDATES_PER_SECOND = 20000
MAX_CREATES_PER_MINUTE = 200000
ENABLE_TCP_LISTENER = True
LINE_RECEIVER_INTERFACE = 0.0.0.0
LINE_RECEIVER_PORT = 2103
ENABLE_UDP_LISTENER = True
UDP_RECEIVER_INTERFACE = 0.0.0.0
UDP_RECEIVER_PORT = 2103
ENABLE_PICKLE_LISTENER = True
PICKLE_RECEIVER_INTERFACE = 0.0.0.0
PICKLE_RECEIVER_PORT = 2104
USE_INSECURE_UNPICKLER = False
CACHE_QUERY_INTERFACE = 0.0.0.0
CACHE_QUERY_PORT = 7002
USE_FLOW_CONTROL = True
LOG_UPDATES = False
WHISPER_AUTOFLUSH = False
ENABLE_MANHOLE = False
# Example: store everything
# BIND_PATTERNS = #
[cache:b]
WHISPER_FALLOCATE_CREATE = True
WHISPER_FADVISE_RANDOM = False
GRAPHITE_ROOT = /opt/graphite
GRAPHITE_CONF_DIR = /opt/graphite/conf
GRAPHITE_STORAGE_DIR = /opt/graphite/storage
LOCAL_DATA_DIR = /opt/graphite/storage/whisper
USER = apache
MAX_CACHE_SIZE = inf
MAX_UPDATES_PER_SECOND = 20000
MAX_CREATES_PER_MINUTE = 200000
CACHE_QUERY_PORT = 7102
ENABLE_TCP_LISTENER = True
LINE_RECEIVER_INTERFACE = 0.0.0.0
LINE_RECEIVER_PORT = 2203
ENABLE_UDP_LISTENER = True
UDP_RECEIVER_INTERFACE = 0.0.0.0
UDP_RECEIVER_PORT = 2203
ENABLE_PICKLE_LISTENER = True
PICKLE_RECEIVER_INTERFACE = 0.0.0.0
PICKLE_RECEIVER_PORT = 2204
USE_INSECURE_UNPICKLER = False
CACHE_QUERY_INTERFACE = 0.0.0.0
USE_FLOW_CONTROL = True
LOG_UPDATES = False
WHISPER_AUTOFLUSH = False
ENABLE_MANHOLE = False
# Example: store everything
# BIND_PATTERNS = #
[relay]
USER = apache
LINE_RECEIVER_INTERFACE = 0.0.0.0
LINE_RECEIVER_PORT = 2003
UDP_RECEIVER_INTERFACE = 0.0.0.0
UDP_RECEIVER_PORT = 2003
PICKLE_RECEIVER_INTERFACE = 0.0.0.0
PICKLE_RECEIVER_PORT = 2004
MAX_DATAPOINTS_PER_MESSAGE = 2800
MAX_QUEUE_SIZE = 220000
# Set this to False to drop datapoints when any send queue (sending datapoints
# to a downstream carbon daemon) hits MAX_QUEUE_SIZE. If this is True (the
# default) then sockets over which metrics are received will temporarily stop accepting
# data until the send queues fall below 80% MAX_QUEUE_SIZE.
USE_FLOW_CONTROL = True
# Local relay sharding to multiple carbon-cache on localhost, we only support
# sharding in this config. Change $targets if needed
RELAY_METHOD = consistent-hashing
REPLICATION_FACTOR = 1
DESTINATIONS = 127.0.0.1:2104,localhost:2204
Below is the graph for carbon cache queue and carbon cache size:
I am trying to reduce carbon cache size to zero as suggested in this blog.
Also I would highly appreciate if you could reply to my earlier query. Thanks in Advance !!
I don't think that document suggests reducing the carbon cache size to zero. I think this is the relevant section:
If the carbon-cache daemon (or daemons, if you have configured multiple) is unable to write data to your storage medium at a fast enough rate, its internal cache will be saturated, and it will start to drop incoming metrics. This will typically happen if the volume and rate of incoming metrics is larger than your I/O subsystem can support writing.
...
This graph shows the relationship between incoming data points, and datapoints committed to disk, while superimposing the size of the internal cache on top. You should be able to quickly identify any capacity issues here: The rate if incoming data points is continuously higher than the rate of committed points, and the cache size is ever-increasing (until it at some points hits the max cache size, configured in carbon.conf).
...
The only way around this is to scale up your Graphite infrastructure. You can add faster drives (solid state drives aren’t a bad idea), or set up a cluster of multiple Graphite servers.
I think that if the carbon cache size is not continuously growing, and it is able to keep up with writing metrics to disk and not dropping any, then all is OK.
Thanks @ploxiln , i get your point. I have usually observed that the metric committed increase the next minute when carbon cache is high. This causes delay of 1 minute for few metrics which I was aiming to eliminate.
Anyways, if you could suggest carbon configs that i asked in the comment one row above, i would really appreciate it.
@satish-chef :
But still the "carbon cache size" and "carbon cache queue" is still high.
I do not see any high usage. You have 150K metric/min, and 30K of them in cache, so, ~ 12 seconds of them? IMO it's quite good result, but if you want to improve it try to set MAX_CACHE_SIZE to some sane value, (50K - 100K - 200K, pick one), then increase MAX_UPDATES_PER_SECOND until cache size stop decreasing - that means that you reach limit of your disk performance. Of course, you'll need to restart carbon caches every time to apply parameters and wait some time for stabilizing queue size, so, I don't know is all efforts still worth it.
Hi @deniszh , I tried changing MAX_CACHE_SIZE to 50K and then 100K after 2 minutes. The result was carbon cache dropped to zero and so did the number of metrics. This really backfired. So I am stopping my optimisation efforts as of now and closing this issue as things look stable on the Graphite server. Big thanks to everyone in this thread to who helped me out.
I am running a Graphite server in production in which data is sent from 10 AWS regions using carbon-relay-ng. The version of whisper is 0.9.12, CentOS version is 6.10. The Graphite server is hosted in AWS cloud in an EC2 instance of type c4.8xlarge( 60 GB RAM & 36 CPU cores ). The Graphite storage "/opt/graphite/storage" is mounted on a 6 TB GP2 EBS volume with 16000 IOPS.
Below is the content of /opt/graphite/conf/carbon.conf on the Graphite server:
Below is the Graph of committed points:
Below is the Graph of metrics received for both carbon-caches:
Below are some other graphs that might help you assess the situation:
I am observing metrics dropped for all metrics. The frequency of metric drops is 12-20 every 12 hours. Metric drop for some metrics is very less like 2 every 12 hours. Can someone please suggest configuration settings for carbon-cache and/or carbon-relay which can help me eliminate metric drop issue. Thanks in advance !