metrics stopped flowing after 30 min of moderate use

khelenek commented 6 years ago

Grails (Spring Boot 1.5.7)

Gradle: compile 'io.micrometer:micrometer-core:1.0.0-rc.6' compile 'io.micrometer:micrometer-registry-statsd:1.0.0-rc.6' compile 'io.micrometer:micrometer-spring-legacy:1.0.0-rc.6'

application.yml: spring.metrics.statsd.enabled: true spring.metrics.flavor: datadog spring.metrics.host: localhost spring.metrics.port: 8125

Code: compositeMeterRegistry.timer(TIMER_NAME, "tagname", "tagvalue).record(milliseconds, TimeUnit.MILLISECONDS)

Behavior: During testing and QA the data showed up in Datadog. We put a canary server into production temporarily. Metrics flowed fine for about 30 minutes, then completely stopped (it's been 6 hours since any micrometer metrics were seen). I am able to run diagnostic Groovy code on the box and this is as far as I could get:

compositeMeterRegistry.meters.find { it.id.name == TIMER_NAME }.firstChild().publisher.queue.size()

This returned about 1M after 5 hours. So it seems like the data just stopped being consumed (sent to the agent). The datadog agent collects other metrics from the machine, and those continue to work fine. There are no errors in the datadog logs or through the datadog-agent status command, so I suspect it's not getting the data from the java app.

I'm not sure what else to do to diagnose, but happy to run any tests. As a side note I would prefer not to bring in reactor and netty to the project and just use the java-dogstatsd-client behind the micrometer interfaces, but I'm not sure how hard that would be.

jkschneider commented 6 years ago

Thanks @khelenek.

Sounds like the UDP client broke for some reason and things just piled up in the UnicastProcessor. I'll take a look.

As a side note I would prefer not to bring in reactor and netty to the project and just use the java-dogstatsd-client.

I think it's reasonable for us to go ahead and shade reactor and netty so your runtime classpath is unaffected. java-dogstatsd-client is unfortunately not an option for us, because this implementation is serving more than just Datadog. We also publish Etsy and Telegraf flavored statsd.

jkschneider commented 6 years ago

We believe this might be related to https://github.com/reactor/reactor-netty/issues/232. Haven't been able to reproduce here, so we're just going to go with a theory that this is the case. I'll be releasing 1.0.0-rc.8 later today with an upgrade to reactor-netty that addressed that issue. If you don't mind testing sometime tomorrow or later this week, that would be very helpful!

maciejwalkowiak commented 6 years ago

Same things happens for me - even after upgrade to 1.0.0-rc.8. Under low load - all works good. In production environment on the other hand after ~30 minutes of moderate load metrics are gone.

I customised StatsdMeterRegistry to log when both subscriptions end with error or complete like this:

public void start() {
    UdpClient.create(statsdConfig.host(), statsdConfig.port())
             .newHandler((in, out) -> out
                                 .options(NettyPipeline.SendOptions::flushOnEach)
                                 .sendString(publisher)
                                 .neverComplete()
                        )
             .subscribe(client -> {
                            this.udpClient.replace(client);

                            // now that we're connected, start polling gauges
                            meterPoller.replace(Flux.interval(statsdConfig.pollingFrequency())
                                                    .doOnEach(n -> {
                                                        synchronized (pollableMeters) {
                                                            pollableMeters.forEach(StatsdPollable::poll);
                                                        }
                                                    })
                                                    .subscribe(__ -> {
                                                               },
                                                               error -> LOG.error("Error during polling metrics", error),
                                                               () -> LOG.error("Polling metrics completed")));
                        }, error -> LOG.error("Error during connecting to UDP", error),
                        () -> LOG.error("UDP connection completed"));
}

Unfortunately nothing in logs. Queue size seems to be fine and the datadog agent running on the instance continues sending metrics to Datadog, just without those coming from Micrometer.

Any hints how we could debug it?

checketts commented 6 years ago

If metrics just stop writing, without any logging indicating a failure, do you have a Thread.setDefaultUncaughtExceptionHandler(<handler>) set that could log out when that background thread is dying along with the corresponding stacktrace?

jkschneider commented 6 years ago

@smaldini Any thoughts and a better mechanism to understand why UdpClient stops publishing? It must be some sort of uncaught exception, so I think the question is how best to catch that and record it (and ideally recover from it)?

smaldini commented 6 years ago

If we enable debug log (SLF4J reactor.ipc.netty=DEBUG) it might be super verbose to capture a long session but that would be very beneficial for the investigation. Alternatively we can try to reproduce the behavior. If I get it, it's at most one UdpClient running 1 connection, so by load you mean many packet written, am I right ?

smaldini commented 6 years ago

Also @maciejwalkowiak what did you mean by queue size is fine? Is the queue size changing ? Is it empty ? Is it constantly increasing ?

jkschneider commented 6 years ago

@maciejwalkowiak Can we also assume that you are directing your Micrometer StatsD connection at the datadog agent running alongside your app on the same host?

maciejwalkowiak commented 6 years ago

@smaldini

If I get it, it's at most one UdpClient running 1 connection, so by load you mean many packet written, am I right ?

I meant "production" grade load (lets say up to 1k requests per second) on the server which reflects way more metrics sent to statsd I believe.

We are not able to turn debug logs on production server.

For the time we are getting metrics queue size is going up and down so all looks good:

@jkschneider

@maciejwalkowiak Can we also assume that you are directing your Micrometer StatsD connection at the datadog agent running alongside your app on the same host?

Yes, statsd agent works on the same machine.

There is also an option that something is wrong with our datadog agent and scripts around it, we will continue investigating (and maybe switch to sending metrics to Datadog directly).

Thanks for support!

jkschneider commented 6 years ago

@maciejwalkowiak Another idea: could you add a gauge on meter size. Perhaps metrics stop flowing from dogstatsd when they reach some fixed threshold that we don't know about:

registry.gauge("meter.total", registry, r -> r.getMeters().size());

maciejwalkowiak commented 6 years ago

We migrated to direct integration with Datadog, which seem to work after few adjustments.

As soon as I have a chance to get back to statsd integration I will give it a try.

jkschneider commented 6 years ago

So I created a quick sample which just piles more and more tags onto a gauge called large.cardinality.gauge (adding one more tag every 10 ms), then plotted the results over the course of 30 minutes:

Notice how statsd.queue.size looks blank? It didn't in the first couple minutes of the app running. After an initial period, it seems like the queue stays consistently drained (the client is "keeping up"), even as the number of unique meters continues to grow beyond 100k. So the gauge value is 0. I think that there may be something about how datadog (or dogstatsd) deals with consistently zero-value gauges. Perhaps it can't distinguish between a zero-value and non-existent gauge, so it just drops it from the graphs.

Other metrics that are something other than non-zero gauges seem to continue to report fine.

jkschneider commented 6 years ago

After #403, statsd.queue.size continues to be plotted even though the queue is generally drained:

I think we've solved this?

micrometer-metrics / micrometer

metrics stopped flowing after 30 min of moderate use #354