Closed ecsumed closed 5 years ago
first thing, is there anything in the log of carbon-c-relay at the point of a missing point?
No, after the startup entries which includes the config and this, there are no logs:
[2019-06-06 14:50:05] (MSG) listening on tcp4 0.0.0.0 port 2003
[2019-06-06 14:50:05] (MSG) listening on tcp6 :: port 2003
[2019-06-06 14:50:05] (MSG) listening on udp4 0.0.0.0 port 2003
[2019-06-06 14:50:05] (MSG) listening on udp6 :: port 2003
[2019-06-06 14:50:05] (MSG) listening on UNIX socket /tmp/.s.carbon-c-relay.2003
[2019-06-06 14:50:05] (MSG) starting 2 workers
[2019-06-06 14:50:05] (MSG) starting statistics collector
[2019-06-06 14:50:05] (MSG) starting servers
[2019-06-06 14:50:05] (MSG) startup sequence complete
I've checked all of the logs, syslog, go-carbon, c-relay, but nothing is out of the ordinary. My first thought was that the oom killer was being activated, but you'll notice I don't have a queue.
Don't know if it helps but I left this setup on overnight and at kept sending points at 150 second intervals. These were the times that the point went missing for all of the hosts:
1559836050 Thu Jun 6 15:47:30 UTC 2019
1559850750 Thu Jun 6 19:52:30 UTC 2019
1559860500 Thu Jun 6 22:35:00 UTC 2019
1559870250 Fri Jun 7 01:17:30 UTC 2019
1559880000 Fri Jun 7 04:00:00 UTC 2019
1559889750 Fri Jun 7 06:42:30 UTC 2019
I'll run in debug mode and see if I encounter anything.
So, every 150 seconds you send a new metric, and your graphite store is also configured to have 150 second buckets, for 1 day.
carbon-c-relay's statistics are sent out once every 60 seconds as per your configuration.
At 15:50 or 15:49 you observe a missing point, but no missing points to the statistics, or so it seems. I'm wondering if your timestamps are just at the clock boundary and occasionally just miss it, hence producing a bucket with two samples (overwriting each other).
Valid point. If that's the case, by increasing the clock boundary (interval), missing points should decrease or go away all together. I'll try it.
One last thing, before I close this, if I may. In my graphs https://imgur.com/a/mZjAroM between 15:30 and 16:30, according to the relay, no points were sent or received. Nothing abnormal in the logs and I can confirm the data ends up in Whisper. What would this be caused by? Would decreasing the statistics interval help here?
well, if your retention is 150 seconds, you best submit statistics at the same rate
statistics
submit every 150 seconds
reset counters after interval
;
@grobian I looked deeper into my timestamps after you suggested that a data point may be overwriting. You were right. It was a problem with my test data producer which was posting metrics (approximately 1-3 seconds), then waiting the 150s interval. The 1-3 seconds delay would eventually manifest in a data point being skipped. After I made the interval dynamic, the missing points went away altogether.
Apologies, for this false flag. Closing this.
I'm happy we (more: you) managed to find the cause and fix it, so we know for sure here the relay is doing nothing weird. Thanks for the update!
I'm running a simple t2.large relay with 3 c5.large go-carbons with 720k points every 150 seconds. This works fine, but I've noticed that every hour or so (this interval is not exact), at the same epoch for all of the hosts, a data point disappears.
The relay sent/received metrics also seem to stop for a brief interval that's longer than the interval for the missing data point. I've verified that the data is indeed coming during this time.
Here are some of the graphs: https://imgur.com/a/mZjAroM
My retention:
Carbon conf:
go-carbon.conf
There's ample cpu and memory on both the relay and the carbon hosts. I've also tried swapping go-carbon with carbon-cache, but the same behavior persists, so the cache doesn't seem to be the problem. What am I missing?