grobian / carbon-c-relay

Enhanced C implementation of Carbon relay, aggregator and rewriter
Apache License 2.0
380 stars 107 forks source link

Missing data point after an hour or so, relay sent/recieved reports nothing for even longer. #366

Closed ecsumed closed 5 years ago

ecsumed commented 5 years ago

I'm running a simple t2.large relay with 3 c5.large go-carbons with 720k points every 150 seconds. This works fine, but I've noticed that every hour or so (this interval is not exact), at the same epoch for all of the hosts, a data point disappears.

The relay sent/received metrics also seem to stop for a brief interval that's longer than the interval for the missing data point. I've verified that the data is indeed coming during this time.

Here are some of the graphs: https://imgur.com/a/mZjAroM

My retention:

[default]
pattern = .*
retentions = 150s:1d

Carbon conf:

cluster graphite
    fnv1a_ch
        <carbon1-ip>:2003=a
        <carbon2-ip>:2003=b
        <carbon3-ip>:2003=c
    ;

match *
    send to graphite
    ;

statistics
    reset counters after interval
    ;

go-carbon.conf

[common]
user = "carbon"
graph-prefix = "carbon.agents.{host}"
metric-endpoint = "local"
metric-interval = "1m0s"
max-cpu = 4

[whisper]
data-dir = "/mnt/whisper"
schemas-file = "/opt/graphite/conf/storage-schemas.conf"
workers = 8
max-updates-per-second = 0
max-creates-per-second = 0
hard-max-creates-per-second = false
sparse-create = false
flock = false
enabled = true
hash-filenames = true

[cache]
max-size = 1000000
write-strategy = "max"

[udp]
listen = "0.0.0.0:2003"
enabled = true
buffer-size = 0

[tcp]
listen = "0.0.0.0:2003"
enabled = true
buffer-size = 0

[pickle]
listen = "0.0.0.0:2004"
max-message-size = 67108864
enabled = true
buffer-size = 0

[carbonlink]
listen = "0.0.0.0:7002"
enabled = true
read-timeout = "30s"

[grpc]
listen = "127.0.0.1:7003"
enabled = true

[[logging]]
logger = ""
file = "/var/log/go-carbon/go-carbon.log"
level = "info"
encoding = "mixed"
encoding-time = "iso8601"
encoding-duration = "seconds"

There's ample cpu and memory on both the relay and the carbon hosts. I've also tried swapping go-carbon with carbon-cache, but the same behavior persists, so the cache doesn't seem to be the problem. What am I missing?

grobian commented 5 years ago

first thing, is there anything in the log of carbon-c-relay at the point of a missing point?

ecsumed commented 5 years ago

No, after the startup entries which includes the config and this, there are no logs:

[2019-06-06 14:50:05] (MSG) listening on tcp4 0.0.0.0 port 2003
[2019-06-06 14:50:05] (MSG) listening on tcp6 :: port 2003
[2019-06-06 14:50:05] (MSG) listening on udp4 0.0.0.0 port 2003
[2019-06-06 14:50:05] (MSG) listening on udp6 :: port 2003
[2019-06-06 14:50:05] (MSG) listening on UNIX socket /tmp/.s.carbon-c-relay.2003
[2019-06-06 14:50:05] (MSG) starting 2 workers
[2019-06-06 14:50:05] (MSG) starting statistics collector
[2019-06-06 14:50:05] (MSG) starting servers
[2019-06-06 14:50:05] (MSG) startup sequence complete

I've checked all of the logs, syslog, go-carbon, c-relay, but nothing is out of the ordinary. My first thought was that the oom killer was being activated, but you'll notice I don't have a queue.

Don't know if it helps but I left this setup on overnight and at kept sending points at 150 second intervals. These were the times that the point went missing for all of the hosts:

1559836050 Thu Jun  6 15:47:30 UTC 2019
1559850750 Thu Jun  6 19:52:30 UTC 2019
1559860500 Thu Jun  6 22:35:00 UTC 2019
1559870250 Fri Jun  7 01:17:30 UTC 2019
1559880000 Fri Jun  7 04:00:00 UTC 2019
1559889750 Fri Jun  7 06:42:30 UTC 2019

I'll run in debug mode and see if I encounter anything.

grobian commented 5 years ago

So, every 150 seconds you send a new metric, and your graphite store is also configured to have 150 second buckets, for 1 day.

carbon-c-relay's statistics are sent out once every 60 seconds as per your configuration.

At 15:50 or 15:49 you observe a missing point, but no missing points to the statistics, or so it seems. I'm wondering if your timestamps are just at the clock boundary and occasionally just miss it, hence producing a bucket with two samples (overwriting each other).

ecsumed commented 5 years ago

Valid point. If that's the case, by increasing the clock boundary (interval), missing points should decrease or go away all together. I'll try it.

One last thing, before I close this, if I may. In my graphs https://imgur.com/a/mZjAroM between 15:30 and 16:30, according to the relay, no points were sent or received. Nothing abnormal in the logs and I can confirm the data ends up in Whisper. What would this be caused by? Would decreasing the statistics interval help here?

grobian commented 5 years ago

well, if your retention is 150 seconds, you best submit statistics at the same rate

statistics
   submit every 150 seconds
   reset counters after interval
   ;
ecsumed commented 5 years ago

@grobian I looked deeper into my timestamps after you suggested that a data point may be overwriting. You were right. It was a problem with my test data producer which was posting metrics (approximately 1-3 seconds), then waiting the 150s interval. The 1-3 seconds delay would eventually manifest in a data point being skipped. After I made the interval dynamic, the missing points went away altogether.

Apologies, for this false flag. Closing this.

grobian commented 5 years ago

I'm happy we (more: you) managed to find the cause and fix it, so we know for sure here the relay is doing nothing weird. Thanks for the update!