ralexstokes / relay-monitor

MIT License
60 stars 17 forks source link

Add uptime to the relay monitoring schema #34

Open taylorjdawson opened 1 year ago

taylorjdawson commented 1 year ago

It would be nice to have a standard way to measure a relay's uptime. Currently we have the /eth/v1/builder/status endpoint and this is inadequate as there is no way to determine if a relay is creating artificial uptime by returning a static page.

Relay monitor has this endpoint /monitor/v1/faults. Would be nice to either: a) rename to from /monitor/v1/faults to /monitor/v1/stats and include { faults: {...} } as a part of the payload b) add a new monitor/v1/stats that includes uptime stat along with other relevant metrics

ralexstokes commented 1 year ago

uptime would be neat to see

my only concern is getting a super precise signal but if we are ok w/ some lossy-ness then I think the relay monitor could support this

do you have anything particular in mind?

I would think to start w/ a simple poll of /eth/v1/builder/status although you raise a good point about static or cached assets making this endpoint a little less meaningful

I don't really want to maintain a set of routines per relay to do some arbitrary liveness check though...

another thought I had was to encourage an ecosystem norm that relays expose metrics although im not sure how to do this in a DoS-resistant way; another option is to encourage a norm that relays just expose a liveness check for this purpose with reputation backing the claim that it is a reliable signal and not cached etc in some way

what do you think?

metachris commented 1 year ago

What's the goal you want to accomplish?

I'm not sure uptime is an important metric for relays. They can be down and it has no impact if they don't submit any bids. Therefore I'm not sure this is a relevant task that should be added to the relay monitor responsibilities.

For reference, there's also the discussion here about having the relay status endpoint return the latest slot to prove it's not just a static page. Alternatively, you could just call getHeader on every slot and see the actual latency and uptime based on that? (although not every relay would provide a bid for every slot)