paritytech / substrate-telemetry

Polkadot Telemetry service
GNU General Public License v3.0
299 stars 208 forks source link

Update telemetry buckets #535

Closed mrcnski closed 8 months ago

mrcnski commented 1 year ago

ISSUE

Overview

telemetry.polkadot.io is not always very helpful:

Screenshot 2023-05-30 at 16 22 33

76% of validators are on kernel version "Other", which doesn't say much. :P

Proposal

Clean up the string (i.e. remove the -148 part) before displaying, but allow users to view the full data somehow.

Other buckets that could be improved:

  1. CPU: 60.63% | 1206 | Other
  2. Version: 22.42% | 447 | Other
  3. Memory: 39.23% | 781 | At least 64 GB (64 GB bucket is missing and I figure that's the most common one...)
jsdw commented 9 months ago

Thanks for the issue @mrcnski; yeah, I agree that that's less than ideal! Possibly the string cleaning bit can be done purely in the UI. Re the CPU and memory etc, I'd have to remember what substrate provides; maybe we need to tweak something upstream, or maybe we can just add eg the 64gb bucket here :)

lexnv commented 8 months ago

Proposed Solution

To provide more insightful information, I've created a PR that handles the parsing in the backend entirely:

One downside of this is that the full kernel version is no longer visible to the end user; however, the upside is that we group kernels more uniformly by their version. In other words, we'll display more kernel version numbers by loosening the grouping criteria.

Alternatives

We could send the list of all kernel versions and CPU models to the frontend. Then, let the frontend handle all the parsing of data. However, we would want to have a bounded limit on the lists, say 100 kernel versions to not overload the frontend.

In this case, it might be possible to have ~20 entries with the same kernel version (5.10.0) but different patch versions or commits; and by that heuristic, we'll probably end up displaying 5 unique kernel versions

For the CPU Vendor we could update the substrate binary to also provide this string, which could probably lead to more vendors being discovered. However, I believe we can build up on the proposed solution and expand it in the future if we decide it's important enough.

For the memory, we could add more buckets between the [64; 128) GiB if we decide that doesn't offer enough granularity. For more details check the stream hardware survey.

@mrcnski would love to hear your thoughts on this 🙏

mrcnski commented 8 months ago

Thanks @lexnv! I'm just wondering, what are the pros and cons of sending all the data to the frontend? It might be useful for the frontend to have all the stats. I don't see why 100 versions would overload it, that doesn't seem like a lot - how many nodes are sending the data?

That aside, the proposal sounds really good to me!

lexnv commented 8 months ago

The downside of sending all the data to the frontend is that we'll probably have a very large payload. Potentially up the number of nodes in the network, presuming the nodes are targeting the substrate telemetry endpoint. That could lead to quite a large number of entries. However, to make an informed decision we'd probably have to inspect the telemetry core entries / deploy a new telemetry core in beta / rococo.

The upside would be that we have access to all the data directly in frontend. That would lead to the frontend performing the parsing and optionally displaying extra information (possibly all the information) of kernel versions / CPU names etc

Thanks for taking a look at this!

mrcnski commented 8 months ago

That makes sense! Just to make sure I understand the architecture correctly: the nodes send their data to the telemetry endpoint (the backend), which then sends the aggregated payload to the frontend?

jsdw commented 8 months ago

That makes sense! Just to make sure I understand the architecture correctly: the nodes send their data to the telemetry endpoint (the backend), which then sends the aggregated payload to the frontend?

That's right, yup :)