Closed wojniemiec closed 8 years ago
Sounds like something is clearing out stats. Do you know what metrics library you are using? You can check this at /admin/registry by looking at the registry.stats section.
On Tue, Feb 2, 2016 at 2:14 AM, Wojciech Niemiec notifications@github.com wrote:
Hello, I didn't know where should I ask this question, I hope it's ok to put it here. I use Finatra Framework, we want to gather some metrics using built in admin server. I saw some documentation about metrics on Twitter Server github pages ( http://twitter.github.io/twitter-server/Features.html#metrics). This is the endpoint I use: http://localhost:9990/admin/metrics.json?pretty=true
Let's say we have created some endpoints in our app, when I visit those, I see that number of requests displayed on metrics admin page increases (route/endpoint/GET/requests: 2), also I see some nice numbers for time based metrics, for example route/endpoint/GET/time.avg. The problem is that after I refresh metrics page, time based metrics are cleared. Number of requests remain the same, but metrics like time.avg, time.count, time.max, response_size.avg have value of 0. It looks like admin page counts only those requests that were not measured after last /metrics page visit. How can I get metrics for last 5 or 10 minutes?
— Reply to this email directly or view it on GitHub https://github.com/twitter/twitter-server/issues/33.
Kevin Oliver | Twitter, Inc | follow me: @kevino http://twitter.com/kevino
I can access /admin/registry.json endpoint only. There is a big json response. not sure which part should I check, so I send it as attachment.
Hmm. Ok from the "loadservice" section I see that you are using finagle-stats — thats good!
My guess is that this is old behavior that has since been fixed. Do you know what version of Finagle (or Finatra) you are running?
On Thu, Feb 4, 2016 at 12:38 AM, Wojciech Niemiec notifications@github.com wrote:
I can access /admin/registry.json endpoint only. There is a big json response. not sure which part should I check, so I send it as attachment.
registry-json.txt https://github.com/twitter/twitter-server/files/117073/registry-json.txt
— Reply to this email directly or view it on GitHub https://github.com/twitter/twitter-server/issues/33#issuecomment-179708914 .
Kevin Oliver | Twitter, Inc | follow me: @kevino http://twitter.com/kevino
Ok here they are:
FInatra version: 2.1.2 Finagle-core: 6.31.0 Finagle-stats: 6.31.0 Finagle-zipkin: 6.31.0 Finagle-http: 6.31.0
Scala 2.11.x
@wojniemiec our metrics are windowed, is it possible that this is what you're running into? If you wait for too long after making a request, it may have disappeared from the window. I think the window-size is one minute, so if you wait for too long they might disappear.
Well sometimes they disappear on second visit, maybe its because the first time I visit I get cached metrics from previous window. The thing I would love to have is to see from what point of time and to what point of time those metrics are, because now its a bit enigmatic. But anyway, thanks for your help!
@wojniemiec that sounds like a reasonable idea, would like you to make a counter that increments every time the window moves and add it to MetricsStatsReceiver? We'd be happy to accept a PR :+1:
Yeah, this is something that should be better documented.
Do you think explaining it here[1] would've helped?
[1] https://twitter.github.io/twitter-server/Features.html#metrics
On Sun, Feb 7, 2016 at 11:54 PM, Moses Nakamura notifications@github.com wrote:
@wojniemiec https://github.com/wojniemiec that sounds like a reasonable idea, would like you to make a counter that increments every time the window moves and add it to MetricsStatsReceiver? We'd be happy to accept a PR [image: :+1:]
— Reply to this email directly or view it on GitHub https://github.com/twitter/twitter-server/issues/33#issuecomment-181244720 .
Kevin Oliver | Twitter, Inc | follow me: @kevino http://twitter.com/kevino
@mosesn I could give it a try when I get some time on weekend ;)
@kevinoliver Yes, that would be great to have it there, I checked this page several times looking for an answer but I couldn't find it. Still, it was pretty helpful to me.
@wojniemiec let us know if you need any help with this feature!
@wojniemiec I added a note to the TwitterServer docs–they should be open sourced Monday, and will be pushed to the docs website on the next release.
I faced the same problem. I guess this impl is the cause of this https://github.com/twitter/finagle/blob/develop/finagle-stats/src/main/scala/com/twitter/finagle/stats/MetricsBucketedHistogram.scala#L75
when admin/metrics.json is accessed, this line increases nextSnapAfter by 1minute if Time.now > nextSnapAfter.
If you haven't access admin/metrics.json for 3days, nextSnapAfter
have the datetime of ( last access datetime to /admin/metrics.json + 1.min). If you access the url after the non-access time, nextSnapAfter
is increased by 1min.
This still matches `clear+recompute
condition:
https://github.com/twitter/finagle/blob/develop/finagle-stats/src/main/scala/com/twitter/finagle/stats/MetricsBucketedHistogram.scala#L73
So it works as if whenever you access /admin/metrics.json, it's cleared.
@yokoshin yep, that's it. We've documented it here: https://github.com/twitter/twitter-server/blob/develop/doc/src/sphinx/Features.rst#metrics but it hasn't made it to the regular docs yet because we generally push docs out when we do a release.
Thank you for your replay. I’ll explain my thought by my patch https://github.com/yokoshin/finagle/commit/cfd3fa72f1d1d0174e2c81f8618ffaebd1994a2d
I think the problem is, to make MetricsBucketHistogram work correctly, we should access "/admin/metrics.json" more than once every minute.
In the original code, after 5 min of non-access period to "/admin/metrics.json", you should access the url 5 times, because nextSnapAfter
is increment by only 1min.
Before nextSnapAfter
reaches Time.now
, histograms are cleared and re-caluclated on every requests.
That why @wojniemiec said "Time based metrics disappearing on second visit"
My patch will resolve this problem. how do you think?
@yokoshin ohhhh I understand now. Good catch! Your approach looks good, my one concern is that for folks who have a statsd that checks minutely (like we do at twitter) this will push it forward incorrectly when it's a little late. Since this is mostly for folks who are significantly behind on touching it, what do you think about doing it instead for 2 * latchPeriod
, instead of just latchPeriod
?
Looks good :+1:
@mosesn if we use 2*latchPeriod
, If a user access "admin/metrics.json" in 3 times over a few seconds, and the time condition is
( nextSnapAfter + latchPeriod *2 ) > Time.now > nextSnapAfter - 1 )
the result will be
If you don't mind changing the spec that histograms refresh beginning of every-minute, we can simply implement like this.
//no if statement
nextSnapAfter.set(Time.now + latchPeriod)
This seems natural for both users and monitoring system.
Or if I consider 1-second slack of re-calculate condition,
// Time.now > nextSnapAfter.get without 1second, then use Time.now
if (Time.now >= nextSnapAfter.get ) {
nextSnapAfter.set(Time.now + latchPeriod)
} else {
nextSnapAfter.set(nextSnapAfter.get + latchPeriod)
}
This always set nextSnapAfter
1min later from now.
So you can avoid false-positive detection from monitor tools.
Also users can easily understand how MetricsBucketHistgram works.
@mosesn Let's close this issue. As you said, Finagle is used in many companies. So, it's better to avoid unnecessary changes. I committed patch based on your advice. https://github.com/yokoshin/finagle/commit/a17a3dbca9c1769f74009408410202df033e4ba8 If you agree to this, I'll sent a pull request to finalge-repo.
@mosesn Coud I have your comment ?
@yokoshin sorry, I've been thinking about it on and off. I definitely think we should change something, but I've been thinking more about the solution you proposed–it's appealing because it's so much simpler. I think in certain implementations it could be more complicated though–for example, if you have a single service which talks to each service at the beginning of every minute.
So I think it would make sense for you to put up your patch https://github.com/yokoshin/finagle/commit/a17a3dbca9c1769f74009408410202df033e4ba8 as a PR and we can discuss how we want to go forward. I'd be curious to hear other people's opinions too.
I think there are a couple tricky things here. One of them is how windowing should work. We weight data evenly within a window, so it's easy to reason about, and we don't lose interesting data within that window. This has great properties, but problems can arise if we allow clients to control the snapshotting process. What happen if somebody hits the the admin endpoint who's not a server? Let's say it's in the middle of a minute, and it snapshots, and then the next time the server hits, it's only seeing half a minute of data.
Another approach would be to not snapshot, and instead take an exponential moving average of the histogram. The risk here is that we would discount events that happened at the beginning of the minute, relative to the end of the minute, although this has the benefit that it avoids the snapshotting problem, we would be able to take data over a slightly longer window into account, and it means we could look at more recent data. EMAs are a little more difficult to reason about than histogram snapshots though.
One possible compromise might be to do the EMA discounting on snapshot windows, so we get the best of both worlds. The main problem with this approach is that it's still more difficult to reason about than snapshotted histograms.
I think that the current approach + jump to current minute if you've fallen behind is a good compromise. Another possible one is to have a timer thread which keeps track of it for you, which wouldn't necessarily be bad, but might cause unnecessary work and we would have to move the metrics data between threads, which is a disadvantage.
As described in twitter/finagle#492, I find this behavior to be very surprising, and probably wrong. (And just because it's widely used doesn't mean it's well-understood or correct ;)
?period=N
param that manages timer-based latching.Perhaps we could implement EMA as another StatsReceiver (i.e. analagous to ImmediateMetricsStatsReceiver) so that this could be used as an opt-in. Still, I think any coded assumptions about window-size are problematic. Multiple clients want different views of the same data.
I suppose one option would be to set up an /admin endpoint for each desired granularity... I don't believe that this is possible today since granularity is controlled by the StatsReceiver and not the exporter. (I think?)
@olix0r granularity is two-tiered. The SR has one granularity (one minute) and the exporter has another (one second).
@yokoshin do you have time to submit a PR from https://github.com/twitter/finagle/commit/a17a3dbca9c1769f74009408410202df033e4ba8. As @kevinoliver mentioned, it probably needs a unit test. If you don't have time to get to that, I can probably help get the branch ready. Thanks!
@olix0r Thank you. At first, I wag going to wait until the discussion was concluded. But that seems to include some new features. So it's better to make a PR for this bug separately.
Hello, I didn't know where should I ask this question, I hope it's ok to put it here. I use Finatra Framework, we want to gather some metrics using built in admin server. I saw some documentation about metrics on Twitter Server github pages (http://twitter.github.io/twitter-server/Features.html#metrics). This is the endpoint I use: http://localhost:9990/admin/metrics.json?pretty=true
Let's say we have created some endpoints in our app, when I visit those, I see that number of requests displayed on metrics admin page increases (route/endpoint/GET/requests: 2), also I see some nice numbers for time based metrics, for example route/endpoint/GET/time.avg. The problem is that after I refresh metrics page, time based metrics are cleared. Number of requests remain the same, but metrics like time.avg, time.count, time.max, response_size.avg have value of 0. It looks like admin page counts only those requests that were not measured after last /metrics page visit. How can I get metrics for last 5 or 10 minutes?