Expose metrics about the connected node network on the `/metrics` endpoint

paritytech / substrate-telemetry

Polkadot Telemetry service

GNU General Public License v3.0

310 stars 205 forks source link

Expose metrics about the connected node network on the `/metrics` endpoint #381

Open jsdw opened 3 years ago

jsdw commented 3 years ago

Currently, the /metrics endpoint exposes prometheus-compatible metrics mainly focused on evaluating the runtime performance of the telemetry deployment.

@chevdor mentioned that it would also be a great place to expose network related metrics, such as which versions of substrate are currently in use (and by how many nodes). I could certainly see this being quite handy to track!

Other thoughts on specific network related metrics that would be valuable to expose are welcome!

lovelaced commented 3 years ago

@jsdw this could potentially have very high/arbitrary cardinality and, if the network name was a label, it could be used to DDoS any prometheus server a publicly exposed instance is connected to. Builds by people who are compiling their own commits and/or building their own networks will add lots of label complexity unfortunately. This would indeed be very interesting information but it would best be exposed as some sort of JSON so it could be scraped into a database or something rather than use prometheus for this. @gabreal any thoughts?

dvdplm commented 3 years ago

Maybe we can control the cardinality in the code to, say 500 chains max. @lovelaced Is that an acceptable number?

lovelaced commented 3 years ago

From https://www.robustperception.io/cardinality-is-key (a highly reliable source on prometheus best practices):

As a general rule of thumb I'd avoid having any metric whose cardinality on a /metrics could potentially go over 10 due to growth in label values. The way I think of it is that if it has already grown to be 10 today, it might be 15 in a year's time. It can be additionally okay to have no more than a literal handful within a given Prometheus that go to 100.

my worry is that the multiplier will be huge - each chain having who knows how many versions of their own binaries would be a cardinality explosion.

dvdplm commented 3 years ago

Ok, that's good info. FWIW we distinguish chains based on genesis hash, so a mere label change is not enough for a DDoS, they'd have to spin up actual new chains.

Perhaps we could have an allowlist then, that includes the chains we (Parity) cares about the most?

lovelaced commented 3 years ago

Yeah, that sounds better to me. Keep in mind cardinality is also multiplicative (see previous link), so every unique k=v combination is an increase of 1 - so it would still be DDoSable if someone spun up a ton of nodes with custom binary identifiers on Polkadot, for instance.

dvdplm commented 3 years ago

dq: Does Prometheus have any measures in place to stop ingesting data when it's too much, dunno, put a server on "chill" and simply stop collecting metrics from it when it's too much (or decrease collection frequency)?

lovelaced commented 3 years ago

If there is, I'm not aware of it- and that's a pretty hacky/bad way to deal with it anyway. We should be actively keeping tabs on the amount of series our prometheus server ingests (and we do). This number should be constant accounting for the number of servers (as in, there should be no labels which have arbitrary growth, really).

gabreal commented 3 years ago

for the metrics it is important to consider the dimensions that endpoint will provide. while exporting the version as a label in polkadot (label is only on one metric polkadot_build_info) is alright for a limited amount of servers we control, software like the matrix server expose unbounded metrics which can easily go beyond 2Mb. it would be alright imo to limit the label set by the software like with the above mentioned allowlist of networks and the build info reduced to tagged releases. it is interesting information. w3f made an exporter for substrate telemetry data.

dvdplm commented 3 years ago

w3f made an exporter for substrate telemetry data.

Interesting. It's essentially a rewrite of our substrate-analytics.