petabridge / phobos-issues

Public issues and bug tracker for Phobos®
https://phobos.petabridge.com/
2 stars 1 forks source link

Metrics about heartbeat times can be useful #79

Open object opened 1 month ago

object commented 1 month ago

Occasionally we experience stability problems with our internal network, causing Akka hearbeats to use long time, sometimes leading to SBR shutting down the cluster (with log message like this: ("SBR detected instability and will down all nodes: reachability changed 1 times since 35285,0482 ms ago, latest change was 35285,0482 ms ago")

Wouldn't it be useful if Phobos exposed some of the metrics that showed gossip response times?

Aaronontheweb commented 1 month ago

I think this would be helpful - the only thing that's a tad tricky is measuring this from the outside of the cluster heartbeat actors who measure it. Probably the best way to handle this would be to modify some of the system actors to emit events when things are happening, that way measurement can occur out of band - this is what we did for tracking the total number of live actors / starts / stops.