Time based metrics disappearing on second visit on metrics page

wojniemiec commented 8 years ago

Hello, I didn't know where should I ask this question, I hope it's ok to put it here. I use Finatra Framework, we want to gather some metrics using built in admin server. I saw some documentation about metrics on Twitter Server github pages (http://twitter.github.io/twitter-server/Features.html#metrics). This is the endpoint I use: http://localhost:9990/admin/metrics.json?pretty=true

Let's say we have created some endpoints in our app, when I visit those, I see that number of requests displayed on metrics admin page increases (route/endpoint/GET/requests: 2), also I see some nice numbers for time based metrics, for example route/endpoint/GET/time.avg. The problem is that after I refresh metrics page, time based metrics are cleared. Number of requests remain the same, but metrics like time.avg, time.count, time.max, response_size.avg have value of 0. It looks like admin page counts only those requests that were not measured after last /metrics page visit. How can I get metrics for last 5 or 10 minutes?

kevinoliver commented 8 years ago

Sounds like something is clearing out stats. Do you know what metrics library you are using? You can check this at /admin/registry by looking at the registry.stats section.

On Tue, Feb 2, 2016 at 2:14 AM, Wojciech Niemiec notifications@github.com wrote:

Hello, I didn't know where should I ask this question, I hope it's ok to put it here. I use Finatra Framework, we want to gather some metrics using built in admin server. I saw some documentation about metrics on Twitter Server github pages ( http://twitter.github.io/twitter-server/Features.html#metrics). This is the endpoint I use: http://localhost:9990/admin/metrics.json?pretty=true

Let's say we have created some endpoints in our app, when I visit those, I see that number of requests displayed on metrics admin page increases (route/endpoint/GET/requests: 2), also I see some nice numbers for time based metrics, for example route/endpoint/GET/time.avg. The problem is that after I refresh metrics page, time based metrics are cleared. Number of requests remain the same, but metrics like time.avg, time.count, time.max, response_size.avg have value of 0. It looks like admin page counts only those requests that were not measured after last /metrics page visit. How can I get metrics for last 5 or 10 minutes?

— Reply to this email directly or view it on GitHub https://github.com/twitter/twitter-server/issues/33.

Kevin Oliver | Twitter, Inc | follow me: @kevino http://twitter.com/kevino

wojniemiec commented 8 years ago

I can access /admin/registry.json endpoint only. There is a big json response. not sure which part should I check, so I send it as attachment.

registry-json.txt

kevinoliver commented 8 years ago

Hmm. Ok from the "loadservice" section I see that you are using finagle-stats — thats good!

My guess is that this is old behavior that has since been fixed. Do you know what version of Finagle (or Finatra) you are running?

On Thu, Feb 4, 2016 at 12:38 AM, Wojciech Niemiec notifications@github.com wrote:

I can access /admin/registry.json endpoint only. There is a big json response. not sure which part should I check, so I send it as attachment.

registry-json.txt https://github.com/twitter/twitter-server/files/117073/registry-json.txt

— Reply to this email directly or view it on GitHub https://github.com/twitter/twitter-server/issues/33#issuecomment-179708914 .

Kevin Oliver | Twitter, Inc | follow me: @kevino http://twitter.com/kevino

wojniemiec commented 8 years ago

Ok here they are:

FInatra version: 2.1.2 Finagle-core: 6.31.0 Finagle-stats: 6.31.0 Finagle-zipkin: 6.31.0 Finagle-http: 6.31.0

Scala 2.11.x

mosesn commented 8 years ago

@wojniemiec our metrics are windowed, is it possible that this is what you're running into? If you wait for too long after making a request, it may have disappeared from the window. I think the window-size is one minute, so if you wait for too long they might disappear.

wojniemiec commented 8 years ago

Well sometimes they disappear on second visit, maybe its because the first time I visit I get cached metrics from previous window. The thing I would love to have is to see from what point of time and to what point of time those metrics are, because now its a bit enigmatic. But anyway, thanks for your help!

mosesn commented 8 years ago

@wojniemiec that sounds like a reasonable idea, would like you to make a counter that increments every time the window moves and add it to MetricsStatsReceiver? We'd be happy to accept a PR :+1:

kevinoliver commented 8 years ago

Yeah, this is something that should be better documented.

Do you think explaining it here[1] would've helped?

[1] https://twitter.github.io/twitter-server/Features.html#metrics

On Sun, Feb 7, 2016 at 11:54 PM, Moses Nakamura notifications@github.com wrote:

@wojniemiec https://github.com/wojniemiec that sounds like a reasonable idea, would like you to make a counter that increments every time the window moves and add it to MetricsStatsReceiver? We'd be happy to accept a PR [image: :+1:]

— Reply to this email directly or view it on GitHub https://github.com/twitter/twitter-server/issues/33#issuecomment-181244720 .

Kevin Oliver | Twitter, Inc | follow me: @kevino http://twitter.com/kevino

wojniemiec commented 8 years ago

@mosesn I could give it a try when I get some time on weekend ;)

@kevinoliver Yes, that would be great to have it there, I checked this page several times looking for an answer but I couldn't find it. Still, it was pretty helpful to me.

mosesn commented 8 years ago

@wojniemiec let us know if you need any help with this feature!

mosesn commented 8 years ago

@wojniemiec I added a note to the TwitterServer docs–they should be open sourced Monday, and will be pushed to the docs website on the next release.

yokoshin commented 8 years ago

I faced the same problem. I guess this impl is the cause of this https://github.com/twitter/finagle/blob/develop/finagle-stats/src/main/scala/com/twitter/finagle/stats/MetricsBucketedHistogram.scala#L75

when admin/metrics.json is accessed, this line increases nextSnapAfter by 1minute if Time.now > nextSnapAfter. If you haven't access admin/metrics.json for 3days, nextSnapAfter have the datetime of ( last access datetime to /admin/metrics.json + 1.min). If you access the url after the non-access time, nextSnapAfter is increased by 1min. This still matches `clear+recompute condition: https://github.com/twitter/finagle/blob/develop/finagle-stats/src/main/scala/com/twitter/finagle/stats/MetricsBucketedHistogram.scala#L73

So it works as if whenever you access /admin/metrics.json, it's cleared.

mosesn commented 8 years ago

@yokoshin yep, that's it. We've documented it here: https://github.com/twitter/twitter-server/blob/develop/doc/src/sphinx/Features.rst#metrics but it hasn't made it to the regular docs yet because we generally push docs out when we do a release.

yokoshin commented 8 years ago

Thank you for your replay. I’ll explain my thought by my patch https://github.com/yokoshin/finagle/commit/cfd3fa72f1d1d0174e2c81f8618ffaebd1994a2d

I think the problem is, to make MetricsBucketHistogram work correctly, we should access "/admin/metrics.json" more than once every minute.

In the original code, after 5 min of non-access period to "/admin/metrics.json", you should access the url 5 times, because nextSnapAfter is increment by only 1min.

Before nextSnapAfter reaches Time.now, histograms are cleared and re-caluclated on every requests. That why @wojniemiec said "Time based metrics disappearing on second visit"

My patch will resolve this problem. how do you think?

mosesn commented 8 years ago

@yokoshin ohhhh I understand now. Good catch! Your approach looks good, my one concern is that for folks who have a statsd that checks minutely (like we do at twitter) this will push it forward incorrectly when it's a little late. Since this is mostly for folks who are significantly behind on touching it, what do you think about doing it instead for 2 * latchPeriod, instead of just latchPeriod?

wojniemiec commented 8 years ago

Looks good :+1:

yokoshin commented 8 years ago

@mosesn if we use 2*latchPeriod, If a user access "admin/metrics.json" in 3 times over a few seconds, and the time condition is ( nextSnapAfter + latchPeriod *2 ) > Time.now > nextSnapAfter - 1 ) the result will be

refresh + recaculate
refresh + recaculate
become const It's a little bit confusing for user.

If you don't mind changing the spec that histograms refresh beginning of every-minute, we can simply implement like this.

   //no if statement
    nextSnapAfter.set(Time.now + latchPeriod)

This seems natural for both users and monitoring system.

yokoshin commented 8 years ago

Or if I consider 1-second slack of re-calculate condition,

        // Time.now > nextSnapAfter.get without 1second, then use Time.now
        if (Time.now >= nextSnapAfter.get ) {
          nextSnapAfter.set(Time.now + latchPeriod)
        } else {
          nextSnapAfter.set(nextSnapAfter.get + latchPeriod)
        }

This always set nextSnapAfter 1min later from now. So you can avoid false-positive detection from monitor tools. Also users can easily understand how MetricsBucketHistgram works.

yokoshin commented 8 years ago

@mosesn Let's close this issue. As you said, Finagle is used in many companies. So, it's better to avoid unnecessary changes. I committed patch based on your advice. https://github.com/yokoshin/finagle/commit/a17a3dbca9c1769f74009408410202df033e4ba8 If you agree to this, I'll sent a pull request to finalge-repo.

yokoshin commented 8 years ago

@mosesn Coud I have your comment ?

mosesn commented 8 years ago

@yokoshin sorry, I've been thinking about it on and off. I definitely think we should change something, but I've been thinking more about the solution you proposed–it's appealing because it's so much simpler. I think in certain implementations it could be more complicated though–for example, if you have a single service which talks to each service at the beginning of every minute.

So I think it would make sense for you to put up your patch https://github.com/yokoshin/finagle/commit/a17a3dbca9c1769f74009408410202df033e4ba8 as a PR and we can discuss how we want to go forward. I'd be curious to hear other people's opinions too.

mosesn commented 8 years ago

I think there are a couple tricky things here. One of them is how windowing should work. We weight data evenly within a window, so it's easy to reason about, and we don't lose interesting data within that window. This has great properties, but problems can arise if we allow clients to control the snapshotting process. What happen if somebody hits the the admin endpoint who's not a server? Let's say it's in the middle of a minute, and it snapshots, and then the next time the server hits, it's only seeing half a minute of data.

Another approach would be to not snapshot, and instead take an exponential moving average of the histogram. The risk here is that we would discount events that happened at the beginning of the minute, relative to the end of the minute, although this has the benefit that it avoids the snapshotting problem, we would be able to take data over a slightly longer window into account, and it means we could look at more recent data. EMAs are a little more difficult to reason about than histogram snapshots though.

One possible compromise might be to do the EMA discounting on snapshot windows, so we get the best of both worlds. The main problem with this approach is that it's still more difficult to reason about than snapshotted histograms.

I think that the current approach + jump to current minute if you've fallen behind is a good compromise. Another possible one is to have a timer thread which keeps track of it for you, which wouldn't necessarily be bad, but might cause unnecessary work and we would have to move the metrics data between threads, which is a disadvantage.

olix0r commented 8 years ago

As described in twitter/finagle#492, I find this behavior to be very surprising, and probably wrong. (And just because it's widely used doesn't mean it's well-understood or correct ;)

The "catching up" issue is certainly a bug. If I don't request stats for 2 hours, and then request stats every second for 2 minutes, it makes no sense that latching wouldn't go into effect immediately.
I'm particularly keen on not having latching windows hardcoded at all. In linkerd, we would like to show secondly data for our admin page and to be able to show minutely (or 10-secondly, or ...) data to metrics scrapers. I'm very curious how EMA could work for this.
If latching is supported, I'm skeptical that it should be controlled at all by external requests. I'm extremely skeptical that there should be a hardcoded latching window. For what it's worth, ostrich supports client-defined windows -- requests may be issued with a ?period=N param that manages timer-based latching.

Perhaps we could implement EMA as another StatsReceiver (i.e. analagous to ImmediateMetricsStatsReceiver) so that this could be used as an opt-in. Still, I think any coded assumptions about window-size are problematic. Multiple clients want different views of the same data.

olix0r commented 8 years ago

I suppose one option would be to set up an /admin endpoint for each desired granularity... I don't believe that this is possible today since granularity is controlled by the StatsReceiver and not the exporter. (I think?)

mosesn commented 8 years ago

@olix0r granularity is two-tiered. The SR has one granularity (one minute) and the exporter has another (one second).

olix0r commented 8 years ago

@yokoshin do you have time to submit a PR from https://github.com/twitter/finagle/commit/a17a3dbca9c1769f74009408410202df033e4ba8. As @kevinoliver mentioned, it probably needs a unit test. If you don't have time to get to that, I can probably help get the branch ready. Thanks!

yokoshin commented 8 years ago

@olix0r Thank you. At first, I wag going to wait until the discussion was concluded. But that seems to include some new features. So it's better to make a PR for this bug separately.

twitter / twitter-server

Time based metrics disappearing on second visit on metrics page #33