Open jsdw opened 3 years ago
@jsdw this could potentially have very high/arbitrary cardinality and, if the network name was a label, it could be used to DDoS any prometheus server a publicly exposed instance is connected to. Builds by people who are compiling their own commits and/or building their own networks will add lots of label complexity unfortunately. This would indeed be very interesting information but it would best be exposed as some sort of JSON so it could be scraped into a database or something rather than use prometheus for this. @gabreal any thoughts?
Maybe we can control the cardinality in the code to, say 500 chains max. @lovelaced Is that an acceptable number?
From https://www.robustperception.io/cardinality-is-key (a highly reliable source on prometheus best practices):
As a general rule of thumb I'd avoid having any metric whose cardinality on a /metrics could potentially go over 10 due to growth in label values. The way I think of it is that if it has already grown to be 10 today, it might be 15 in a year's time. It can be additionally okay to have no more than a literal handful within a given Prometheus that go to 100.
my worry is that the multiplier will be huge - each chain having who knows how many versions of their own binaries would be a cardinality explosion.
Ok, that's good info. FWIW we distinguish chains based on genesis hash, so a mere label change is not enough for a DDoS, they'd have to spin up actual new chains.
Perhaps we could have an allowlist then, that includes the chains we (Parity) cares about the most?
Yeah, that sounds better to me. Keep in mind cardinality is also multiplicative (see previous link), so every unique k=v combination is an increase of 1 - so it would still be DDoSable if someone spun up a ton of nodes with custom binary identifiers on Polkadot, for instance.
dq: Does Prometheus have any measures in place to stop ingesting data when it's too much, dunno, put a server on "chill" and simply stop collecting metrics from it when it's too much (or decrease collection frequency)?
If there is, I'm not aware of it- and that's a pretty hacky/bad way to deal with it anyway. We should be actively keeping tabs on the amount of series our prometheus server ingests (and we do). This number should be constant accounting for the number of servers (as in, there should be no labels which have arbitrary growth, really).
for the metrics it is important to consider the dimensions that endpoint will provide. while exporting the version as a label in polkadot (label is only on one metric polkadot_build_info
) is alright for a limited amount of servers we control, software like the matrix server expose unbounded metrics which can easily go beyond 2Mb. it would be alright imo to limit the label set by the software like with the above mentioned allowlist of networks and the build info reduced to tagged releases. it is interesting information. w3f made an exporter for substrate telemetry data.
w3f made an exporter for substrate telemetry data.
Interesting. It's essentially a rewrite of our substrate-analytics.
Currently, the
/metrics
endpoint exposes prometheus-compatible metrics mainly focused on evaluating the runtime performance of the telemetry deployment.@chevdor mentioned that it would also be a great place to expose network related metrics, such as which versions of substrate are currently in use (and by how many nodes). I could certainly see this being quite handy to track!
Other thoughts on specific network related metrics that would be valuable to expose are welcome!