haskell-github-trust / ekg-core

Library for tracking system metrics
BSD 3-Clause "New" or "Revised" License
40 stars 39 forks source link

Add support for time-since-last-action heartbeat metrics #26

Open jessekempf opened 6 years ago

jessekempf commented 6 years ago

I've found there's a common pattern in services I write, where I want a heartbeat or watchdog timer for periodic jobs being run by an in-memory scheduler. I thought it'd be good to contribute it back as a change upstream.

23Skidoo commented 6 years ago

/cc @tibbe

tibbe commented 6 years ago

Could this be implemented outside the library using registerGroup?

jessekempf commented 6 years ago

It could be, but in that case why wouldn't Counter, Gauge, Label, and Distribution be implemented outside the library?

jessekempf commented 6 years ago

Also, when I took a look at all Hackage packages with "ekg" in the name, all of them reuse the primitives defined in ekg-core. And registerGroup seems to be for composites of primitive metrics, but a Heartbeat is atomic.

tibbe commented 6 years ago

The Value type captures semantic information about the metric being monitoried:

 data Value = Counter {-# UNPACK #-} !Int64
            | Gauge {-# UNPACK #-} !Int64
            | Label {-# UNPACK #-} !T.Text
            | Distribution !Distribution.Stats

Counters are monotonically increasing, gauges can go both up and down, and labels/distributions are different types. A heartbeat is simply a type of counter, not a semantically different kind of thing. Same story for MetricSampler. The different constructors there are so we can construct the semantically right Value.

(Now, could all the registerFoo functions be written in terms of registerGroup? I haven't thought about it or tried it, maybe it can be done.)

jessekempf commented 6 years ago

Right, if anything a heartbeat is a type of gauge, but where the value is a direct function of time rather than an indirect one. As an entity it has a different set of primitive operations on it because it is measuring time rather than quantity.

If the temporal semantics of a heartbeat don't matter, and instead it's a type of gauge, by the same course of reasoning the monotonically-increasing semantics of a counter shouldn't matter, because it's implementable in terms of a gauge. Of course, the reason any quantity can be implemented in terms of a gauge is that a gauge is any integer-valued function f(t). Ignoring the fact gauge in this implementation is a signed 64-bit integer, one could argue that a label is a kind of gauge because strings are countably infinite and so there's a bijection of them onto the integers.

But it makes sense to use Haskell's type system to encode the different usage rules for the different kinds of things we want to monitor when building software in the real world. Gauges are quantity-valued, Counters are quantity-valued but can only be incremented (though the types admit adding a negative increase), and Heartbeats operate only on times.

I will totally cop to ignoring typesafety in System.Metrics.Heartbeat.read, and following that through to completion in making the constructor be Heartbeat :: Int64 -> Value instead of Heartbeat :: UTCTime -> Value or Heartbeat :: NominalDiffTime -> Value and handling the rendering to integer value in the sampling function instead of at each of the individual reporters. One of my questions for you was going to be "am I doing the conversion to an integer too early?".

tibbe commented 6 years ago

the monotonically-increasing semantics of a counter shouldn't matter, because it's implementable in terms of a gauge.

The distinction makes a difference to the consumer and is why statsd also has this distinction: if you know you're monitoring a monotonically-increasing value you know that if the value went down it must be because the thing you monitored was restarted (or similar). This in turns means that you can accurately graph the value over time (e.g. requests/s) in face on e.g. failing machines.

I still don't quite see why heartbeat can't be just a gauge, could you give a client code example showing how it will be used?