grobian / carbon-c-relay

Enhanced C implementation of Carbon relay, aggregator and rewriter
Apache License 2.0
380 stars 107 forks source link

Top level relay dropping metrics #196

Closed druchoo closed 5 months ago

druchoo commented 8 years ago

Graphite cluster setup as follows:

graphite-web-0.9.15-1.el6.noarch
carbon-c-relay-2.1-1.el6.x86_64
python-carbon-0.9.15-1.el6.noarch

capture

/usr/bin/carbon-c-relay -P /var/run/carbon-c-relay/carbon-c-relay.pid -D -B 512 -S 10 -m -p 2003 -w 8 -b 2500 -q 25000 -l /var/log/carbon-c-relay/carbon-c-relay.log -f /etc/carbon-c-relay.conf

Relay1:

cluster my_cluster
  carbon_ch replication 2
    10.1.2.10:2003
    10.1.2.11:2003
    10.1.2.12:2003
  ;
match
    ^BadMetric1\.
    ^BadMetric2\.
    ^BadMetric3\.
    ^BadMetric4\.
    ^BadMetric5\.
  send to blackhole
  stop;
match *
  send to
    my_cluster
  stop
  ;

Graphite{1..3}: carbon-c-relay.conf

cluster local_carbon
  carbon_ch
    127.0.0.1:2103=a
    127.0.0.1:2203=b
    127.0.0.1:2303=c
    127.0.0.1:2403=d
    127.0.0.1:2503=e
    127.0.0.1:2603=f
    127.0.0.1:2703=g
    127.0.0.1:2803=h
  ;
rewrite ^foo\.(.+)\.(.+)\.([a-zA-Z]+)([0-9]+)
    into bar.\_1.\2.\3.\3\4
    ;
{54 additional rewrite rules}
match *
  send to
    local_carbon
  stop
  ;

carbon.conf

[cache]
STORAGE_DIR    = /var/lib/carbon/
LOCAL_DATA_DIR = /var/lib/carbon/whisper/
WHITELISTS_DIR = /etc/carbon/
CONF_DIR       = /etc/carbon/
LOG_DIR        = /var/log/carbon/
PID_DIR        = /var/run/carbon/
ENABLE_LOGROTATION = True
USER = carbon
MAX_CACHE_SIZE = 1000000
MAX_UPDATES_PER_SECOND = 500
MAX_UPDATES_PER_SECOND_ON_SHUTDOWN = inf
MAX_CREATES_PER_MINUTE = 5000
LINE_RECEIVER_INTERFACE = 127.0.0.1
LINE_RECEIVER_PORT = 2103
ENABLE_UDP_LISTENER = False
PICKLE_RECEIVER_INTERFACE = 127.0.0.1
PICKLE_RECEIVER_PORT = 2104
LOG_LISTENER_CONNECTIONS = True
USE_INSECURE_UNPICKLER = False
CACHE_QUERY_INTERFACE = 127.0.0.1
CACHE_QUERY_PORT = 7102
USE_FLOW_CONTROL = False
LOG_UPDATES = True
LOG_CACHE_HITS = True
LOG_CACHE_QUEUE_SORTS = True
CACHE_WRITE_STRATEGY = naive
WHISPER_AUTOFLUSH = False
WHISPER_FALLOCATE_CREATE = True
USE_WHITELIST = True
CARBON_METRIC_PREFIX = carbon
CARBON_METRIC_INTERVAL = 10
[cache:b]
LINE_RECEIVER_PORT = 2203
PICKLE_RECEIVER_PORT = 2204
CACHE_QUERY_PORT = 7202
[cache:c]
LINE_RECEIVER_PORT = 2303
PICKLE_RECEIVER_PORT = 2304
CACHE_QUERY_PORT = 7302
[cache:d]
LINE_RECEIVER_PORT = 2403
PICKLE_RECEIVER_PORT = 2404
CACHE_QUERY_PORT = 7402
[cache:e]
LINE_RECEIVER_PORT = 2503
PICKLE_RECEIVER_PORT = 2504
CACHE_QUERY_PORT = 7502
[cache:f]
LINE_RECEIVER_PORT = 2603
PICKLE_RECEIVER_PORT = 2604
CACHE_QUERY_PORT = 7602
[cache:g]
LINE_RECEIVER_PORT = 2703
PICKLE_RECEIVER_PORT = 2704
CACHE_QUERY_PORT = 7702
[cache:h]
LINE_RECEIVER_PORT = 2803
PICKLE_RECEIVER_PORT = 2804
CACHE_QUERY_PORT = 7802

All hosts are 8 core. Graphite{1..3} have 12,000 IOPS each.

Cluster is doing ~110k metricsReceived on Relay1. Relay1 is consistently dropping ~40k metrics and stalling ~150. Graphite{1..3} have more than enough CPU and IOPS to spare. I'm at a loss why Relay1 is dropping metrics.

If queue on Relay1 is increased to something like '-q 5000000', the queue fills up to capacity then starts dropping same amount of metrics.

I've experimented with following carbon.conf settings with little to no change other than increase/decrease of Graphite{1..3} carbon cache queues, updateOperations, & pointsPerUpdates.

MAX_CACHE_SIZE
MAX_UPDATES_PER_SECOND
USE_FLOW_CONTROL
CACHE_WRITE_STRATEGY 

Is this misconfiguration? Any help would be appreciated.

grobian commented 8 years ago

Just to be sure, do you see the drops on relay1, or on graphite{1..3}?

druchoo commented 8 years ago

Drops are on Relay1.

On Fri, Jul 1, 2016, 4:12 PM Fabian Groffen notifications@github.com wrote:

Just to be sure, do you see the drops on relay1, or on graphite{1..3}?

— You are receiving this because you authored the thread. Reply to this email directly, view it on GitHub https://github.com/grobian/carbon-c-relay/issues/196#issuecomment-230037570, or mute the thread https://github.com/notifications/unsubscribe/ACIkU7qnqziJjg1m4K0teW7vwpM-4f1Jks5qRXTEgaJpZM4JDTle .

grobian commented 8 years ago

what's the stats of the relays on graphite{1..3}? how are their queues?

Try setting the maxstalls on the graphite{1..3} relays to 0, and increase their queuesize. You shouldn't see drops on relay1 but on the graphite{1...3} relays instead.

druchoo commented 8 years ago

Here are carbon-c-relay stats. graphite{1..3} queues average ~700 each.

Last 12h: carbon.relays.*.{metric,*connect}* screen shot 2016-07-02 at 11 27 34 am

Queues for graphite{1..3} were actually set to 65536 but I don't think that matters since no metrics are being dropped.

I've changed max stalls to 0 on graphite{1..3} as suggested.

/usr/bin/carbon-c-relay -D -L 0 -B 512 -S 10 -m -p 2003 -w 8 -b 2500 -q 65536

Will let that run a bit and post back same dashboard. Let me know if you're interested in any other stats than what was provided.

druchoo commented 8 years ago

@grobian,

There is no change in dropped metrics after implementing -L 0 on graphite{1..3} relays. The graphs are exactly the same.

grobian commented 8 years ago

can you tell me what the network speed of relay1 is?

druchoo commented 8 years ago

Relay1 is an EC2 m3.2xlarge instance and graphite{1..3} are m4.2xlarge. Both instance types are rated for "high" network performance. So 100 Mbps to 1.86 Gbps (depending on where you look). Here's network throughput as reported by collectd for last 24h.

screen shot 2016-07-04 at 11 09 42 am

grobian commented 8 years ago

ok, it could be that a throughput maximum has been reached. I don't know if you can easily test, with the stream, but if you'd setup a netcat listener sending to /dev/null on the graphite{1...3} we could see if relay1 still has issues getting rid of the data.

druchoo commented 8 years ago

Haven't had time to test with your suggestion yet.. Don't suppose it would be possible to log why metrics are being dropped or stalled? Or is there only one reason from carbon-c perspective as to why that would occur?

druchoo commented 8 years ago

OK finally got some time to test with nc.

graphite{1..3} nc -k -l 2003 >/dev/null

relay1 in debug /usr/bin/carbon-c-relay -B 512 -S 10 -m -p 2003 -w 8 -b 2500 -q 25000 -f /etc/carbon-c-relay.conf -d

Observed same behaviour of dropping metrics. Noticed the following errors.

[2016-07-07 20:17:44] failed to write() to x.x.x.x:2003: Resource temporarily unavailable
[2016-07-07 20:17:50] failed to write() to x.x.x.x:2003: Connection reset by peer
[2016-07-07 20:18:00] failed to write() to x.x.x.x:2003: Resource temporarily unavailable
[2016-07-07 20:18:03] failed to write() to x.x.x.x:2003: Connection reset by peer
[2016-07-07 20:18:05] failed to write() to x.x.x.x:2003: Resource temporarily unavailable
[2016-07-07 20:18:13] failed to write() to x.x.x.x:2003: Connection reset by peer
[2016-07-07 20:18:18] failed to write() to x.x.x.x:2003: Connection reset by peer
[2016-07-07 20:18:24] failed to write() to x.x.x.x:2003: Resource temporarily unavailable

Do these errors indicate anything specific?

druchoo commented 8 years ago

@grobian,

I've tested network throughput with iperf3 between relay and graphite hosts and was able to reach 1Gbps.

On a hunch I downgraded to carbon-c-relay v1.10 (2016-03-15). This version seems to be much more stable in my case. I'm seeing at max ~10 drops (down from 40k) but granted more stalls (~500-2k).

grobian commented 8 years ago

Resource temporarily unavailable

That's pretty much important, it means no new connection can be made. Can you check how many filedescriptors are associated to the process? E.g. a cat /proc//limits will tell you. My hunch: you just need more max open files.

druchoo commented 8 years ago

Hi @grobian ,

thanks for the suggestion. Unfortunately those errors were due to me running nc with the wrong user. Graphite users have open files set quite high (500k). I reran the nc test with correct user but did not get errors.

As another trouble shooting step I tried go-carbon (https://github.com/lomik/go-carbon) which has resolved the issue. It replaced carbon-c-relay and python-carbon on graphite{1..3}. So it's unclear if carbon-c-relay or python-carbon was the issue for my setup.

Thoughts?

grobian commented 8 years ago

Ok, so nc showed no errors, which means the relay /can/ get stuff pushed out in time. If go-carbon can handle your load (it seems it can do more load) then it looks like the problem is python-carbon's performance. From my personal experience, python-carbon wasn't doing too good, therefore to release some of its load, I wrote carbonserver to take the read part away.

druchoo commented 8 years ago

You haven't experience any issues with carbon-c -> carbon-c ? If not, i'm happy to use go-carbon and close this issue out :-)

Thanks much again for your help.

grobian commented 8 years ago

well, I think I did, so it needs some more investigations, although there are signs pointing at the metric stalling thing, hence the flag to reduce/disable those. Just for me to understand the scenario, is carbon-c-relay still used on the top-level relay with go-carbon? Or did you replace it as well?

druchoo commented 8 years ago

Carbon-c-relay only on top level relay (Relay1) and only go-carbon on Graphite{1..3}.

s4z commented 8 years ago

I'm seeing this issue also however it's only happening on the second level relays. The carbon-c-relay process has plenty of files left as the limit is currently set to 102400 and carbon-c-relay is using only 40.

The top level relay is running on an older centos host and no issues with v1.10. There are three second level relays all on Centos 7 with carbon-c-relay v2.1 from epel.

Errors are:

failed to write() to ip:port: Resource temporarily unavailable

grobian commented 8 years ago

That would be EAGAIN, which would be because of "The file descriptor is for a socket, is marked O_NON‐BLOCK, and write would block."

So, that seems to suggest the kernel buffer or something is full, and the write fails because of that. It shouldn't be non-blocking in the first place, so that's interesting.

grobian commented 5 months ago

closing due to old age, it may still happen with current code, then we need a fresh investigation