grobian / carbon-c-relay

Enhanced C implementation of Carbon relay, aggregator and rewriter
Apache License 2.0
380 stars 107 forks source link

Relay version 3.1 missing datapoints in it's metrics #292

Closed deejay1 closed 6 years ago

deejay1 commented 7 years ago

Seems like the relay misses/duplicates some datapoints when sending it's statistics, which results in holes in the data. A quick tcpdump resulted in:

carbon.relays.relay_host.metricsDropped 38777423 1499944086
carbon.relays.relay_host.metricsDropped 38884789 1499944326
carbon.relays.relay_host.metricsDropped 39008357 1499944806
carbon.relays.relay_host.metricsDropped 39008357 1499944866
carbon.relays.relay_host.metricsDropped 39065223 1499945226
carbon.relays.relay_host.metricsDropped 39065223 1499945226
carbon.relays.relay_host.metricsDropped 39078790 1499945646

Stat configuration is standard:

cluster graphite_cluster
    forward
        graphite.example.com:2003 proto tcp
    ;

statistics
        send to graphite_cluster
        stop
    ;
grobian commented 7 years ago

that's really odd behaviour, sending doubly is very unexpected.

grobian commented 7 years ago

Is the relay reporting any connection issues? The only reason I can think of now why this happens is if writing the metrics to the remote graphite server fails for some reason.

deejay1 commented 7 years ago

During this time no connection errors were logged, it was straight from a tcpdump from the relay. One potential point we're investigating right now is bonding issues on the receiving hosts or something similar because we route the metrics back to the load balanced relay pool, which are then forwarded to two graphite clusters - one with fnv1a_ch replication (where we're seeing the gaps) and one "normal" which seems to be fine

grobian commented 6 years ago

closing this issue for now, please reopen if the problem persists