Internal metrics collection

jsdelivr / globalping

A global network of probes to run network tests like ping, traceroute and DNS resolve

https://www.jsdelivr.com/globalping

248 stars 31 forks source link

Internal metrics collection #200

Open jimaek opened 2 years ago

jimaek commented 2 years ago

For tasks like https://github.com/jsdelivr/globalping/issues/37 and many others we will need to know more about the probes.

Here is a list of data we would probably need to collect to build the next features:

Accepted tests, in progress tests, finished tests, failed tests. Accepted/in progress probably just real-time value, while the rest are timeseries. Maybe even total and per type as well?
CPU load and CPU cores available
Uptime
Other?

We should probably store them for up to 7 days in a timeseries DB. But note that whatever DB we choose it will need it to scale as in the future we will also support scheduled tests, like pinging the same target every minute and building a chart of performance over time per region.

patrykcieszkowski commented 2 years ago

How often would we collect CPU/mem stats? And how often would we pushe them to the API?

Also, how exactly do we define uptime? How long has the probe been connected, or what do we do if the probe reconnects?

jimaek commented 2 years ago

How often would we collect CPU/mem stats? And how often would we push them to the API?

Ideally every few seconds, e.g. every 10s. The more accurate is the data the more we can later do with tests routing between probes. No local buffering, collect and push immediately.

Also, how exactly do we define uptime? How long has the probe been connected, or what do we do if the probe reconnects?

Probably how long probe was connected unless there is a better metric. We can technically measure this both from a probe and the API. So in this case its probably better to collect this data on the API level?

If the probe disconnected and connected with "ready" state after 5 seconds this means the probe had a downtime of 5s.

patrykcieszkowski commented 2 years ago

@MartinKolarik any idea how to reasonably store this data? Ideally we would keep it all under a single record, but since we need older data to expire, we can't do that. I'm not convinced storing each measurement seperately is a good idea either. Maybe we could group them by 24hr periods, per key?

gp:probe:stats:15-08-22
{
    "cpu": [ { "date": "123", ... }, ... ],
    "mem": [ { "date": "123", ... }, ... ]
}

I'm not sure how to record uptime/downtime.

jimaek commented 2 years ago

We could in theory begin only with realtime data as part of websockets pings. So that the API would always have accurate info on CPU load.

But in any case if we use a time series DB the exact format will depend on their rules

MartinKolarik commented 2 years ago

Real-time-only can go into redis in whatever format... Historical data would depend on the selected storage, which likely won't be redis.

jimaek commented 2 years ago

So to me it seems for now we need these real-time values:

CPU load
Available CPU cores
In progress tests

Then we need to select a timeseries DB out of the 100 that exist now and start storing:

Uptime per probe pushed by the API probably
accepted tests/successful tests/failed tests

Since the real-time part only needs Redis we can implement that part first. @MartinKolarik what do you think?

MartinKolarik commented 2 years ago

Yes I agree, implement the first part now using only redis as that's fairly straightforward. The other part I'd postpone until after #176.