samstarling / finagle-prometheus

Provides a bridge between Finagle and Prometheus metrics
https://samstarling.co.uk/projects/finagle-prometheus/
MIT License
30 stars 18 forks source link

Race condition on PrometheusStatsReceiver.counter #24

Closed seanbrant closed 5 years ago

seanbrant commented 6 years ago

I'm getting the error java.lang.IllegalArgumentException: Collector already registered that provides name: finagle_my_service_443_connect_latency_ms_count.

This is my current theory:

  1. Thread 1 calls counters.getOrElseUpdate which doesn't have the counter so it grabs the lock
  2. Thread 2 calls counters.getOrElseUpdate which doesn't have the counter and can't grab the lock so it blocks
  3. Thread 1 registers a new counter and releases the lock
  4. Thread 2 grabs the lock and registers a new counter which produces the error

I think the lock needs to be around https://github.com/samstarling/finagle-prometheus/blob/master/src/main/scala/com/samstarling/prometheusfinagle/PrometheusStatsReceiver.scala#L39-L41 to prevent this error.

FWIW I was able to reproduce this issue in the PrometheusStatsReceiverRaceTest however since it's a race condition it's not consistent.

samstarling commented 6 years ago

Hey @seanbrant – thanks for the clear report! I don't get much of a chance to use this in production, so it's really useful to get things like this. Sorry you've come across this bug, though.

I just opened PR #25: could you take a look and see if it looks reasonable to you? I'm also curious if you made any changes to the PrometheusStatsReceiverRaceTest: it'd be nice to tighten up the testing there a little, so if you came across a way to replicate it more easily then please let me know.

seanbrant commented 6 years ago

That looks correct to me. Oddly enough adding a print above https://github.com/samstarling/finagle-prometheus/blob/master/src/test/scala/com/samstarling/prometheusfinagle/PrometheusStatsReceiverRaceTest.scala#L19 caused it to happen. Though not consistently. We are seeing it in production consistently when our services get restarted.

Thanks for jumping on this so quickly!

samstarling commented 6 years ago

No problem. I just released 0.0.8, and it's syncing from bintray to Maven Central at the moment – but it can take a little while (hours). I'll leave this issue open for now, but if you get a chance to try it out in production then I'd be really interested to know if this fixes it! 🤞🏼

seanbrant commented 6 years ago

Seeing a new error now when my service starts up.

14:49:41.809 [main] ERROR com.twitter.app.LoadService - LoadService: failed to instantiate 'com.samstarling.prometheusfinagle.PrometheusStatsReceiver' for the requested service 'com.twitter.finagle.stats.StatsReceiver'
java.lang.NullPointerException: null
    at com.twitter.util.ProxyTimer.schedulePeriodically(Timer.scala:148)
    at com.twitter.util.Timer.schedule(Timer.scala:56)
    at com.twitter.util.Timer.schedule$(Timer.scala:53)
    at com.twitter.util.ProxyTimer.schedule(Timer.scala:139)
    at com.twitter.util.Timer.schedule(Timer.scala:66)
    at com.twitter.util.Timer.schedule$(Timer.scala:65)
    at com.twitter.util.ProxyTimer.schedule(Timer.scala:139)
    at com.samstarling.prometheusfinagle.PrometheusStatsReceiver.<init>(PrometheusStatsReceiver.scala:24)
    at com.samstarling.prometheusfinagle.PrometheusStatsReceiver.<init>(PrometheusStatsReceiver.scala:13)
    at sun.reflect.NativeConstructorAccessorImpl.newInstance0(Native Method)
    at sun.reflect.NativeConstructorAccessorImpl.newInstance(NativeConstructorAccessorImpl.java:62)
    at sun.reflect.DelegatingConstructorAccessorImpl.newInstance(DelegatingConstructorAccessorImpl.java:45)
    at java.lang.reflect.Constructor.newInstance(Constructor.java:423)
    at java.lang.Class.newInstance(Class.java:442)
samstarling commented 6 years ago

Hey @seanbrant – sorry it's taken me so long to reply to this. Can I ask what version of Finagle you're using? Which version of this library were you upgrading from?

coduinix commented 6 years ago

It looks like the synchronized for stat metrics is not in the right place. It should be synchronized outside the getOrElseUpdate call. Otherwise it could still do the side effect twice.

I've added a patch file with a test for stat which fails every now and then: stat-test.txt

seanbrant commented 6 years ago

@samstarling finagle 18.2.0 finagle-prometheus 0.0.7

Fluxx commented 6 years ago

Also wanted to add we're seeing this two on the latest finagle 18.6.0 and finagle-prometheus 0.0.9:

[info] 14:55:18.532 [main] ERROR com.twitter.app.LoadService - LoadService: failed to instantiate 'com.samstarling.prometheusfinagle.PrometheusStatsReceiver' for the requested service 'com.twitter.finagle.stats.StatsReceiver'
[info] java.lang.NullPointerException: null
[info]  at com.twitter.util.ProxyTimer.schedulePeriodically(Timer.scala:148)
[info]  at com.twitter.util.Timer$class.schedule(Timer.scala:55)
[info]  at com.twitter.util.ProxyTimer.schedule(Timer.scala:139)
[info]  at com.twitter.util.Timer$class.schedule(Timer.scala:66)
[info]  at com.twitter.util.ProxyTimer.schedule(Timer.scala:139)
[info]  at com.samstarling.prometheusfinagle.PrometheusStatsReceiver.<init>(PrometheusStatsReceiver.scala:23)
[info]  at com.samstarling.prometheusfinagle.PrometheusStatsReceiver.<init>(PrometheusStatsReceiver.scala:13)
[info]  at sun.reflect.NativeConstructorAccessorImpl.newInstance0(Native Method)
[info]  at sun.reflect.NativeConstructorAccessorImpl.newInstance(NativeConstructorAccessorImpl.java:62)
[info]  at sun.reflect.DelegatingConstructorAccessorImpl.newInstance(DelegatingConstructorAccessorImpl.java:45)
[info]  at java.lang.reflect.Constructor.newInstance(Constructor.java:423)
[info]  at java.lang.Class.newInstance(Class.java:442)
[info]  at com.twitter.app.LoadService$$anonfun$loadImpls$2.apply(LoadService.scala:191)
[info]  at com.twitter.app.LoadService$$anonfun$loadImpls$2.apply(LoadService.scala:180)
samstarling commented 6 years ago

@coduinix Sorry for neglecting this issue. I've just raised a pull request (#28) that will fix part of the issue (the NPE), and I'll try and get this released ASAP.

Aside from that, do you think the synchronized for stat metrics is still in the wrong place? If so, I'll raise a separate issue for that.

samstarling commented 6 years ago

Hey @seanbrant: I pushed a variety of fixes recently, but do you still think this race condition exists? I've been hammering the tests today and haven't come across any failures. If you still think it's a problem, let me know – otherwise I'll close this issue. Thanks!