samstarling / finagle-prometheus

Provides a bridge between Finagle and Prometheus metrics
https://samstarling.co.uk/projects/finagle-prometheus/
MIT License
30 stars 18 forks source link

java.lang.IllegalArgumentException: Incorrect number of labels. #19

Closed buxingzhe closed 6 years ago

buxingzhe commented 6 years ago

Today, I see the following log in my application, any idea of this exception? thanks. The finagle version is v18.2.0

WARNING: Exception propagated to the default monitor (upstream address: /34.211.140.117:34521, downstream address: n/a, label: Router).
java.lang.IllegalArgumentException: Incorrect number of labels.
        at io.prometheus.client.SimpleCollector.labels(SimpleCollector.java:64)
        at com.samstarling.prometheusfinagle.PrometheusStatsReceiver$$anon$1.incr(PrometheusStatsReceiver.scala:28)
        at com.twitter.finagle.stats.Counter$class.incr(Counter.scala:9)
        at com.samstarling.prometheusfinagle.PrometheusStatsReceiver$$anon$1.incr(PrometheusStatsReceiver.scala:23)
        at com.twitter.finagle.stats.MultiCategorizingExceptionStatsHandler$$anonfun$record$1.apply(ExceptionStatsHandler.scala:155)
        at com.twitter.finagle.stats.MultiCategorizingExceptionStatsHandler$$anonfun$record$1.apply(ExceptionStatsHandler.scala:154)
        at scala.collection.immutable.List.foreach(List.scala:392)
        at com.twitter.finagle.stats.MultiCategorizingExceptionStatsHandler.record(ExceptionStatsHandler.scala:154)
        at com.twitter.finagle.service.StatsFilter$$anonfun$apply$1.apply(StatsFilter.scala:185)
        at com.twitter.finagle.service.StatsFilter$$anonfun$apply$1.apply(StatsFilter.scala:173)
        at com.twitter.util.Promise$Monitored.apply(Promise.scala:202)
        at com.twitter.util.Promise$Monitored.apply(Promise.scala:198)
        at com.twitter.util.Promise$WaitQueue$class.com$twitter$util$Promise$WaitQueue$$run(Promise.scala:90)
        at com.twitter.util.Promise$WaitQueue$$anon$5.run(Promise.scala:85)
        at com.twitter.concurrent.LocalScheduler$Activation.run(Scheduler.scala:198)
        at com.twitter.concurrent.LocalScheduler$Activation.submit(Scheduler.scala:157)
        at com.twitter.concurrent.LocalScheduler.submit(Scheduler.scala:274)
        at com.twitter.concurrent.Scheduler$.submit(Scheduler.scala:109)
        at com.twitter.util.Promise$WaitQueue$class.runInScheduler(Promise.scala:85)
        at com.twitter.util.Promise$Transformer.runInScheduler(Promise.scala:215)
        at com.twitter.util.Promise.updateIfEmpty(Promise.scala:745)
        at com.twitter.util.Future$$anon$2.apply$mcV$sp(Future.scala:1697)
        at com.twitter.util.Monitor$class.apply(Monitor.scala:46)
        at com.twitter.util.Monitor$$anon$1.apply(Monitor.scala:58)
samstarling commented 6 years ago

Hey @buxingzhe – sorry that you're having this problem. This exception occurs when the number of labels and number of values for those labels don't match up. It's likely a bug in the library, but I haven't come across this before. I'll try and work out what's happening ASAP, sorry for the inconvenience.

cnstoll commented 6 years ago

I think I actually saw this one and forgot to report it too. I believe it was happening when calling count with a string that extractLabels won't find a label in. In my case, I think I was calling it with something like "com.foo.bar", where the periods don't mean anything to it, and I think the extractLabels function is really expecting something in the format "event_name".

samstarling commented 6 years ago

@cnstoll Interesting – that's useful to know. @buxingzhe: what's the name of the counter that you're using? I'll try and write a test case for this and then fix it, thanks for your patience!

cnstoll commented 6 years ago

Dug back through git history. The name of the counter I originally had that failed was log_store.create, and replacing the . with a _ was the first step to fixing it. But I think the prometheus stats receiver's version of the counter is also expecting two "labels", and so I think I ultimately landed on:

statsReceiver.counter("log_store_create", "create").incr()

I think the second "create" was required for it to work.

I could see a lot of ways to handle this in the library. Clearer error handling might be the simplest. But I was pretty new to Prometheus when we started using this, and I thought it would be interesting to provide some sort of convenience "label maker" class that would create a counter string in the right format for prometheus to parse out the event name and pairs of labels from the string. That seems like what the prometheus stats receiver and extractLabels is trying to do - but you kind of have to read through the implementation to see how it works.

samstarling commented 6 years ago

@cnstoll Thanks for taking the time to describe that, it's really useful. Are you using the PrometheusStatsReceiver directly? That's only supposed to be there to act as the bridge for existing Finagle metrics. For creating your own, there's the Telemetry class. I need to think a bit about how to better convey the difference between these two.

For example, you'd do something like this:

val counter = telemetry.counter(
    "log_store",
    "Help text",
    Seq("type"))

counter.labels("create").inc()

Does that perhaps make more sense?

cnstoll commented 6 years ago

Interesting. Yeah, I'm using the PrometheusStatsReceiver directly in my service. So basically, anywhere I was using a finagle StatsReceiver's counter I just started using a PrometheusStatsReceiver. But the strings representing metrics in my service weren't in a good format for Prometheus - hence the exception.

I don't think I really looked at telemetry at all - but that definitely looks a lot closer to what I actually wanted in practice!

buxingzhe commented 6 years ago

@samstarling ,thanks for your quick response. As I don't add any my own metrics, it should belong to finagle. I'll add logs to show the metrics name for details. Have a good time!

samstarling commented 6 years ago

@cnstoll I'd suggest looking at Telemetry – I'm considering how I might make it clearer to people not to use the PrometheusStatsReceiver manually, because it's really just there to act as the bridge. I'm trying to work out if there's a case where anyone would reasonably want to subclass it, or whether I might just be able to make it private so it can't be imported and used. I'd need to check that works how I intend it, though.

samstarling commented 6 years ago

@buxingzhe Do you have any idea which metric is causing the problem here? I haven't been able to replicate this locally yet.

buxingzhe commented 6 years ago

Sorry, after I redeploying my application with logging, the issue haven't seen again. I'll keep observing it and update it once it happens again.

samstarling commented 6 years ago

@buxingzhe Please do let me know if it happens again, and I'll investigate further. For now, I'll close this issue.