Closed druchoo closed 5 months ago
Just to be sure, do you see the drops on relay1, or on graphite{1..3}?
Drops are on Relay1.
On Fri, Jul 1, 2016, 4:12 PM Fabian Groffen notifications@github.com wrote:
Just to be sure, do you see the drops on relay1, or on graphite{1..3}?
— You are receiving this because you authored the thread. Reply to this email directly, view it on GitHub https://github.com/grobian/carbon-c-relay/issues/196#issuecomment-230037570, or mute the thread https://github.com/notifications/unsubscribe/ACIkU7qnqziJjg1m4K0teW7vwpM-4f1Jks5qRXTEgaJpZM4JDTle .
what's the stats of the relays on graphite{1..3}? how are their queues?
Try setting the maxstalls on the graphite{1..3} relays to 0, and increase their queuesize. You shouldn't see drops on relay1 but on the graphite{1...3} relays instead.
Here are carbon-c-relay stats. graphite{1..3} queues average ~700 each.
Last 12h: carbon.relays.*.{metric,*connect}*
Queues for graphite{1..3} were actually set to 65536 but I don't think that matters since no metrics are being dropped.
I've changed max stalls to 0 on graphite{1..3} as suggested.
/usr/bin/carbon-c-relay -D -L 0 -B 512 -S 10 -m -p 2003 -w 8 -b 2500 -q 65536
Will let that run a bit and post back same dashboard. Let me know if you're interested in any other stats than what was provided.
@grobian,
There is no change in dropped metrics after implementing -L 0 on graphite{1..3} relays. The graphs are exactly the same.
can you tell me what the network speed of relay1 is?
Relay1 is an EC2 m3.2xlarge instance and graphite{1..3} are m4.2xlarge. Both instance types are rated for "high" network performance. So 100 Mbps to 1.86 Gbps (depending on where you look). Here's network throughput as reported by collectd for last 24h.
ok, it could be that a throughput maximum has been reached. I don't know if you can easily test, with the stream, but if you'd setup a netcat listener sending to /dev/null on the graphite{1...3} we could see if relay1 still has issues getting rid of the data.
Haven't had time to test with your suggestion yet.. Don't suppose it would be possible to log why metrics are being dropped or stalled? Or is there only one reason from carbon-c perspective as to why that would occur?
OK finally got some time to test with nc.
graphite{1..3}
nc -k -l 2003 >/dev/null
relay1 in debug
/usr/bin/carbon-c-relay -B 512 -S 10 -m -p 2003 -w 8 -b 2500 -q 25000 -f /etc/carbon-c-relay.conf -d
Observed same behaviour of dropping metrics. Noticed the following errors.
[2016-07-07 20:17:44] failed to write() to x.x.x.x:2003: Resource temporarily unavailable
[2016-07-07 20:17:50] failed to write() to x.x.x.x:2003: Connection reset by peer
[2016-07-07 20:18:00] failed to write() to x.x.x.x:2003: Resource temporarily unavailable
[2016-07-07 20:18:03] failed to write() to x.x.x.x:2003: Connection reset by peer
[2016-07-07 20:18:05] failed to write() to x.x.x.x:2003: Resource temporarily unavailable
[2016-07-07 20:18:13] failed to write() to x.x.x.x:2003: Connection reset by peer
[2016-07-07 20:18:18] failed to write() to x.x.x.x:2003: Connection reset by peer
[2016-07-07 20:18:24] failed to write() to x.x.x.x:2003: Resource temporarily unavailable
Do these errors indicate anything specific?
@grobian,
I've tested network throughput with iperf3 between relay and graphite hosts and was able to reach 1Gbps.
On a hunch I downgraded to carbon-c-relay v1.10 (2016-03-15). This version seems to be much more stable in my case. I'm seeing at max ~10 drops (down from 40k) but granted more stalls (~500-2k).
Resource temporarily unavailable
That's pretty much important, it means no new connection can be made. Can you check how many filedescriptors are associated to the process? E.g. a cat /proc/
Hi @grobian ,
thanks for the suggestion. Unfortunately those errors were due to me running nc with the wrong user. Graphite users have open files set quite high (500k). I reran the nc test with correct user but did not get errors.
As another trouble shooting step I tried go-carbon (https://github.com/lomik/go-carbon) which has resolved the issue. It replaced carbon-c-relay and python-carbon on graphite{1..3}. So it's unclear if carbon-c-relay or python-carbon was the issue for my setup.
Thoughts?
Ok, so nc showed no errors, which means the relay /can/ get stuff pushed out in time. If go-carbon can handle your load (it seems it can do more load) then it looks like the problem is python-carbon's performance. From my personal experience, python-carbon wasn't doing too good, therefore to release some of its load, I wrote carbonserver to take the read part away.
You haven't experience any issues with carbon-c -> carbon-c ? If not, i'm happy to use go-carbon and close this issue out :-)
Thanks much again for your help.
well, I think I did, so it needs some more investigations, although there are signs pointing at the metric stalling thing, hence the flag to reduce/disable those. Just for me to understand the scenario, is carbon-c-relay still used on the top-level relay with go-carbon? Or did you replace it as well?
Carbon-c-relay only on top level relay (Relay1) and only go-carbon on Graphite{1..3}.
I'm seeing this issue also however it's only happening on the second level relays. The carbon-c-relay process has plenty of files left as the limit is currently set to 102400 and carbon-c-relay is using only 40.
The top level relay is running on an older centos host and no issues with v1.10. There are three second level relays all on Centos 7 with carbon-c-relay v2.1 from epel.
Errors are:
failed to write() to ip:port: Resource temporarily unavailable
That would be EAGAIN
, which would be because of "The file descriptor is for a socket, is marked O_NON‐BLOCK, and write would block."
So, that seems to suggest the kernel buffer or something is full, and the write fails because of that. It shouldn't be non-blocking in the first place, so that's interesting.
closing due to old age, it may still happen with current code, then we need a fresh investigation
Graphite cluster setup as follows:
/usr/bin/carbon-c-relay -P /var/run/carbon-c-relay/carbon-c-relay.pid -D -B 512 -S 10 -m -p 2003 -w 8 -b 2500 -q 25000 -l /var/log/carbon-c-relay/carbon-c-relay.log -f /etc/carbon-c-relay.conf
Relay1:
Graphite{1..3}: carbon-c-relay.conf
carbon.conf
All hosts are 8 core. Graphite{1..3} have 12,000 IOPS each.
Cluster is doing ~110k metricsReceived on Relay1. Relay1 is consistently dropping ~40k metrics and stalling ~150. Graphite{1..3} have more than enough CPU and IOPS to spare. I'm at a loss why Relay1 is dropping metrics.
If queue on Relay1 is increased to something like '-q 5000000', the queue fills up to capacity then starts dropping same amount of metrics.
I've experimented with following carbon.conf settings with little to no change other than increase/decrease of Graphite{1..3} carbon cache queues, updateOperations, & pointsPerUpdates.
Is this misconfiguration? Any help would be appreciated.