Closed seanbrant closed 5 years ago
Hey @seanbrant – thanks for the clear report! I don't get much of a chance to use this in production, so it's really useful to get things like this. Sorry you've come across this bug, though.
I just opened PR #25: could you take a look and see if it looks reasonable to you? I'm also curious if you made any changes to the PrometheusStatsReceiverRaceTest
: it'd be nice to tighten up the testing there a little, so if you came across a way to replicate it more easily then please let me know.
That looks correct to me. Oddly enough adding a print above https://github.com/samstarling/finagle-prometheus/blob/master/src/test/scala/com/samstarling/prometheusfinagle/PrometheusStatsReceiverRaceTest.scala#L19 caused it to happen. Though not consistently. We are seeing it in production consistently when our services get restarted.
Thanks for jumping on this so quickly!
No problem. I just released 0.0.8
, and it's syncing from bintray to Maven Central at the moment – but it can take a little while (hours). I'll leave this issue open for now, but if you get a chance to try it out in production then I'd be really interested to know if this fixes it! 🤞🏼
Seeing a new error now when my service starts up.
14:49:41.809 [main] ERROR com.twitter.app.LoadService - LoadService: failed to instantiate 'com.samstarling.prometheusfinagle.PrometheusStatsReceiver' for the requested service 'com.twitter.finagle.stats.StatsReceiver'
java.lang.NullPointerException: null
at com.twitter.util.ProxyTimer.schedulePeriodically(Timer.scala:148)
at com.twitter.util.Timer.schedule(Timer.scala:56)
at com.twitter.util.Timer.schedule$(Timer.scala:53)
at com.twitter.util.ProxyTimer.schedule(Timer.scala:139)
at com.twitter.util.Timer.schedule(Timer.scala:66)
at com.twitter.util.Timer.schedule$(Timer.scala:65)
at com.twitter.util.ProxyTimer.schedule(Timer.scala:139)
at com.samstarling.prometheusfinagle.PrometheusStatsReceiver.<init>(PrometheusStatsReceiver.scala:24)
at com.samstarling.prometheusfinagle.PrometheusStatsReceiver.<init>(PrometheusStatsReceiver.scala:13)
at sun.reflect.NativeConstructorAccessorImpl.newInstance0(Native Method)
at sun.reflect.NativeConstructorAccessorImpl.newInstance(NativeConstructorAccessorImpl.java:62)
at sun.reflect.DelegatingConstructorAccessorImpl.newInstance(DelegatingConstructorAccessorImpl.java:45)
at java.lang.reflect.Constructor.newInstance(Constructor.java:423)
at java.lang.Class.newInstance(Class.java:442)
Hey @seanbrant – sorry it's taken me so long to reply to this. Can I ask what version of Finagle you're using? Which version of this library were you upgrading from?
It looks like the synchronized for stat
metrics is not in the right place. It should be synchronized outside the getOrElseUpdate call. Otherwise it could still do the side effect twice.
I've added a patch file with a test for stat
which fails every now and then:
stat-test.txt
@samstarling finagle 18.2.0
finagle-prometheus 0.0.7
Also wanted to add we're seeing this two on the latest finagle 18.6.0
and finagle-prometheus
0.0.9
:
[info] 14:55:18.532 [main] ERROR com.twitter.app.LoadService - LoadService: failed to instantiate 'com.samstarling.prometheusfinagle.PrometheusStatsReceiver' for the requested service 'com.twitter.finagle.stats.StatsReceiver'
[info] java.lang.NullPointerException: null
[info] at com.twitter.util.ProxyTimer.schedulePeriodically(Timer.scala:148)
[info] at com.twitter.util.Timer$class.schedule(Timer.scala:55)
[info] at com.twitter.util.ProxyTimer.schedule(Timer.scala:139)
[info] at com.twitter.util.Timer$class.schedule(Timer.scala:66)
[info] at com.twitter.util.ProxyTimer.schedule(Timer.scala:139)
[info] at com.samstarling.prometheusfinagle.PrometheusStatsReceiver.<init>(PrometheusStatsReceiver.scala:23)
[info] at com.samstarling.prometheusfinagle.PrometheusStatsReceiver.<init>(PrometheusStatsReceiver.scala:13)
[info] at sun.reflect.NativeConstructorAccessorImpl.newInstance0(Native Method)
[info] at sun.reflect.NativeConstructorAccessorImpl.newInstance(NativeConstructorAccessorImpl.java:62)
[info] at sun.reflect.DelegatingConstructorAccessorImpl.newInstance(DelegatingConstructorAccessorImpl.java:45)
[info] at java.lang.reflect.Constructor.newInstance(Constructor.java:423)
[info] at java.lang.Class.newInstance(Class.java:442)
[info] at com.twitter.app.LoadService$$anonfun$loadImpls$2.apply(LoadService.scala:191)
[info] at com.twitter.app.LoadService$$anonfun$loadImpls$2.apply(LoadService.scala:180)
@coduinix Sorry for neglecting this issue. I've just raised a pull request (#28) that will fix part of the issue (the NPE), and I'll try and get this released ASAP.
Aside from that, do you think the synchronized for stat
metrics is still in the wrong place? If so, I'll raise a separate issue for that.
Hey @seanbrant: I pushed a variety of fixes recently, but do you still think this race condition exists? I've been hammering the tests today and haven't come across any failures. If you still think it's a problem, let me know – otherwise I'll close this issue. Thanks!
I'm getting the error
java.lang.IllegalArgumentException: Collector already registered that provides name: finagle_my_service_443_connect_latency_ms_count
.This is my current theory:
counters.getOrElseUpdate
which doesn't have the counter so it grabs the lockcounters.getOrElseUpdate
which doesn't have the counter and can't grab the lock so it blocksI think the lock needs to be around https://github.com/samstarling/finagle-prometheus/blob/master/src/main/scala/com/samstarling/prometheusfinagle/PrometheusStatsReceiver.scala#L39-L41 to prevent this error.
FWIW I was able to reproduce this issue in the
PrometheusStatsReceiverRaceTest
however since it's a race condition it's not consistent.