graphite-project / carbon

Carbon is one of the components of Graphite, and is responsible for receiving metrics over the network and writing them down to disk using a storage backend.
http://graphite.readthedocs.org/
Apache License 2.0
1.51k stars 490 forks source link

Graphite Crashing #537

Closed aolgin closed 8 years ago

aolgin commented 8 years ago

Our Graphite instance has recently been crashing semi-frequently all-of-a-sudden, and I've been unable to determine what exactly the root cause is. The system becomes very sluggish and most, if not all, of our metrics no longer get written during this period.

After doing some investigating, I noticed that /var/log/carbon/console.log had grown exponentially (about 85GB). The majority of the file was filled with: Could not accept new connection (EMFILE)

I initially double-checked that logrotation was set up, and decided to create a separate logrotate script to be a bit stricter than the other one:

/var/log/carbon/console.log {
        size 1G
        daily
        missingok
        rotate 5
        compress
        notifempty
        sharedscripts
        nocreate
}

As a temporary fix, moving or removing console.log and restarting carbon-cache seemed to work initially, but the issue keeps recurring roughly a day or two after doing that.

I dug more into that connection issue that shows up in console.log and found that, according to the Graphite docs, I needed to increase the max number of open files with ulimit or by editing /etc/security/limits.conf. I tried to set 'nofile' to 8192, as recommended, and carbon-cache proceeded to hang again and no metrics came in.

Reading over console.log again, it now had quite a few instances of

exceptions.IOError: [Errno 24] Too many open files: '/var/lib/graphite/whisper/systems/somehost/something.wsp'

So, I've temporarily set a middle ground for user '_graphite's nofile limit to be 4096. We'll see how this works.

I've considering looking into using carbon-relay or aggregation-rules, but haven't been able to really figure those out and am not sure if it'll really be necessary.

Our current setup is a Ubuntu 14.04 AWS instance running graphite-carbon 0.9.12-3. I'll attach my carbon.conf in the follow up message. We have roughly 30 machines reporting into this graphite instance. It is also hosting Grafana and Graphite-Beacon. We make use of carbon.conf, storage-schemas.conf, and storage-aggregation.conf, but none of the other config files.

Carbon crashed for the third time in a week this morning, and I'd really like to get this resolved permanently. Constantly moving/removing the console.log file or even symlinking it to /dev/null just seems like a temporary fix or a cop-out to me, and an actual fix would be greatly preferred. Any help would be appreciated!

(Potentially) on a side note: I've also noticed that, when I shutdown or try to restart carbon-cache, this message shows up in console.log:

Warning: No permission to delete pid file

Which, when I research that, was told to modify the owner of /var/run/carbon-cache.pid to be _graphite, but that warning still shows up (although I doubt that's the root problem here). Any ideas?

aolgin commented 8 years ago

Here's my carbon.conf (minus extra comments):

[cache]
STORAGE_DIR    = /opt/graphite/storage/whisper/
CONF_DIR       = /etc/carbon/
LOG_DIR        = /var/log/carbon/
PID_DIR        = /var/run/

LOCAL_DATA_DIR = /var/lib/graphite/whisper/
ENABLE_LOGROTATION = True
USER = _graphite
MAX_CACHE_SIZE = inf
MAX_UPDATES_PER_SECOND = inf
MAX_CREATES_PER_MINUTE = inf
LINE_RECEIVER_INTERFACE = 0.0.0.0
LINE_RECEIVER_PORT = 2003
ENABLE_UDP_LISTENER = False
UDP_RECEIVER_INTERFACE = 0.0.0.0
UDP_RECEIVER_PORT = 2003
PICKLE_RECEIVER_INTERFACE = 0.0.0.0
PICKLE_RECEIVER_PORT = 2004
LOG_LISTENER_CONNECTIONS = True
USE_INSECURE_UNPICKLER = False
CACHE_QUERY_INTERFACE = 0.0.0.0
CACHE_QUERY_PORT = 7002
USE_FLOW_CONTROL = True
LOG_UPDATES = False
LOG_CACHE_HITS = False
LOG_CACHE_QUEUE_SORTS = True
CACHE_WRITE_STRATEGY = sorted
WHISPER_AUTOFLUSH = False
WHISPER_FALLOCATE_CREATE = True

[relay]
LINE_RECEIVER_INTERFACE = 0.0.0.0
LINE_RECEIVER_PORT = 2013
PICKLE_RECEIVER_INTERFACE = 0.0.0.0
PICKLE_RECEIVER_PORT = 2014
LOG_LISTENER_CONNECTIONS = True
RELAY_METHOD = rules
REPLICATION_FACTOR = 1
DESTINATIONS = 127.0.0.1:2004
MAX_DATAPOINTS_PER_MESSAGE = 500
MAX_QUEUE_SIZE = 10000
USE_FLOW_CONTROL = True

[aggregator]
LINE_RECEIVER_INTERFACE = 0.0.0.0
LINE_RECEIVER_PORT = 2023
PICKLE_RECEIVER_INTERFACE = 0.0.0.0
PICKLE_RECEIVER_PORT = 2024
LOG_LISTENER_CONNECTIONS = True
FORWARD_ALL = True
DESTINATIONS = 127.0.0.1:2004
REPLICATION_FACTOR = 1
MAX_QUEUE_SIZE = 100000
USE_FLOW_CONTROL = True
MAX_DATAPOINTS_PER_MESSAGE = 500
MAX_AGGREGATION_INTERVALS = 5
aolgin commented 8 years ago

I believe I've found the issue. The person who originally set up our graphite had a few separate PHP scripts running on the side that would just query some of our databases. It turns out he gave them an over-long timeout and had them running very frequently.

I haven't really touched those scripts themselves, so they could perhaps use some fixing up anyways.

Going to make some changes there, wait a while and see how it goes, then update this thread.

aolgin commented 8 years ago

Apologies for the late update. Turns out that the PHP scripts were just a red herring and were only an after effect of the real cause.

A small set of the boxes sending metrics to our graphite host via Statsd is the cause. They are configured to send metrics in bursts of 20 seconds, and something is happening to cause those connections to be left open rather than close, so we're getting 3 new connections per minute, per box. That rather quickly builds up to hitting the connection limit for graphite.

I'm going to investigate this further and update once I have a solution.

obfuscurity commented 8 years ago

@aolgin As it's been a while since the last update I'm going to close this now, but feel free to update us if any further action is required.