[Feat]: Multi granularity support for metric data collection on Netdata agent

shyamvalsan commented 2 years ago

Problem

There are many metrics which are important for the user and Netdata should be collecting but where a granularity of 1s does not make sense. Certain metrics make sense at a 30s/60s/300s granularity for example. Other metrics do indeed provide more value at a 1s granularity.

Currently there is a limitation that all metrics need to be collected at the same granularity which causes a problem in the above scenario.

Description

Netdata collector should support collecting different metrics at different granularity. This will allow Netdata to collect a wider range of critical metrics without compromising on high fidelity data collection for metrics where 1s collection matters OR on performance for metrics where 60s or 300s collection is enough.

An example is bloat metrics for postgres tables and indexes - which are too heavy/intensive to be collected every second but provide a lot of value and would be very useful to be collected at a lower granularity (Eg: every 5 minutes) while keeping other Postgres metrics at 1s granularity.

Importance

must have

Value proposition

Makes Netdata more flexible and more powerful by collecting a wider range of metrics without compromising on high fidelity data collection where it matters or on high performance.
Enables the option to be selective on granularity for certain metrics to improve overall performance impact of Netdata metric collection.

Proposed implementation

For metrics with granularity larger than 1s an option would be to report 'last collected value' to cloud until the next collection. This means that though the metric is collected only every X seconds the charts on the frontend do not need to change - they will just show the same value for the granularity period.

shyamvalsan commented 2 years ago

cc: @ilyam8 @ktsaou @ralphm @cakrit @amalkov @sashwathn

shyamvalsan commented 2 years ago

@ktsaou @stelfrag @ilyam8 following on from @ilyam8 comment https://github.com/netdata/netdata/issues/13639#issuecomment-1240433894

Every hour and every especially 24hrs will not work.

Every 24hr doesn't work in general. Few notes about current Netdata:

charts appear after 2nd data collection. metrics can't be collected at any time, but if it is 1hr we should start at 0x:00 (we can't start at 0x:33), so 1hr means that in the worst scenario we wait for 1hr until start then another hr for the 2nd data collection. I can see how 1 minute or probably max 5 minutes can work but not above that.

How can we improve Netdata to not have this limitation? User should be able to select a collection interval of their choosing - many metrics make more sense to be collected every 5min/1hr/daily/etc.

If the first 2 intervals being empty is the problem then can we not collect more frequently for the first couple of intervals and then settle into the longer collection interval?

The current data collection intervals Netdata uses is too aggressive in many cases. Just looking at PostgreSQL collector as an example, but I believe same will exist in other collectors as well. If we want to implement something like this generic SQL collector this will be relevant as well.

cc: @cakrit @amalkov @sashwathn @ralphm

amalkov commented 2 years ago

The idea is good, and I agree that we need flexibility. the question when do we need to do it and how long will it take? Higher granularity means that we are outside the troubleshooting mode, it is more about monitoring and reporting...

We need to understand the effort first.

Ferroin commented 2 years ago

To clarify this a bit, the issue is not so much that all metrics have to be the same granularity (we can set specific collection frequencies on external collectors, for example the Postfix queue collector is a 3 second default collection interval, and there are a handful of others that are 5 or 10 second defaults), it’s that the current chart handling means that collecting on long intervals leads to unexpected behavior from a user perspective.

It’s worth noting that a couple of specific collectors indirectly work around the issues we have with long collection cycles by having an external program log data at the desired collection frequency, and then having Netdata poll that log at it’s own frequency. The SMART collector does this for example, relying on smartd running and logging data at the desired actual collection frequency with the Agent then polling the log file at a different frequency. I don’t know that that’s practical for all metrics that would need a long collection cycle, but it might be worth considering integrating similar functionality into the plugin API directly (IOW, allow a collector to just log the metrics at whatever frequency it wants, and then the Agent just polls the log at the global collection frequency).

ilyam8 commented 2 years ago

Guys, it is not-fixable problem in general, at least for counters (incremental values in Netdata). You need 2 samples to calculate the delta (or rate). So minimum time would be update_every, whatever it is (e.g. 1hr).

If I am not mistaken, the same goes for gauges (absolute values in Netdata), the chart shows the first value after 2 data collections. That looks like a bug to me - we don't calculate delta, no need to wait for the 2nd sample.

I believe there is a workaround for that (can't say if it works now or not) - store_first (CTRL-F store_first) chart option.

There is another feature that contributes to the problem - all data collections should be aligned to update_every. So if:

update_every 1hr
start time XX:30
guess, what? we need to wait 30 mins because we should collect data every XX:00.

ilyam8 commented 2 years ago

IOW, allow a collector to just log the metrics at whatever frequency it wants, and then the Agent just polls the log at the global collection frequency

What we do now is:

do slow data collection every configured time.
cache response.
send values to Netdata every update_every.

Pros:

Charts are alive (expected by Netdata users). Imagine a frozen chart (updates every hour) - it wouldn't be clear if something is broken or just this particular chart is a slow chart. We have no UI elements to show that.
Very easy to implement in a collector (0 complexity).

Cons:

Misleading.

netdata / netdata