Open shyamvalsan opened 2 years ago
cc: @ilyam8 @ktsaou @ralphm @cakrit @amalkov @sashwathn
@ktsaou @stelfrag @ilyam8 following on from @ilyam8 comment https://github.com/netdata/netdata/issues/13639#issuecomment-1240433894
Every hour and every especially 24hrs will not work.
Every 24hr doesn't work in general. Few notes about current Netdata:
charts appear after 2nd data collection. metrics can't be collected at any time, but if it is 1hr we should start at 0x:00 (we can't start at 0x:33), so 1hr means that in the worst scenario we wait for 1hr until start then another hr for the 2nd data collection. I can see how 1 minute or probably max 5 minutes can work but not above that.
How can we improve Netdata to not have this limitation? User should be able to select a collection interval of their choosing - many metrics make more sense to be collected every 5min/1hr/daily/etc.
If the first 2 intervals being empty is the problem then can we not collect more frequently for the first couple of intervals and then settle into the longer collection interval?
The current data collection intervals Netdata uses is too aggressive in many cases. Just looking at PostgreSQL collector as an example, but I believe same will exist in other collectors as well. If we want to implement something like this generic SQL collector this will be relevant as well.
cc: @cakrit @amalkov @sashwathn @ralphm
The idea is good, and I agree that we need flexibility. the question when do we need to do it and how long will it take? Higher granularity means that we are outside the troubleshooting mode, it is more about monitoring and reporting...
We need to understand the effort first.
To clarify this a bit, the issue is not so much that all metrics have to be the same granularity (we can set specific collection frequencies on external collectors, for example the Postfix queue collector is a 3 second default collection interval, and there are a handful of others that are 5 or 10 second defaults), it’s that the current chart handling means that collecting on long intervals leads to unexpected behavior from a user perspective.
It’s worth noting that a couple of specific collectors indirectly work around the issues we have with long collection cycles by having an external program log data at the desired collection frequency, and then having Netdata poll that log at it’s own frequency. The SMART collector does this for example, relying on smartd
running and logging data at the desired actual collection frequency with the Agent then polling the log file at a different frequency. I don’t know that that’s practical for all metrics that would need a long collection cycle, but it might be worth considering integrating similar functionality into the plugin API directly (IOW, allow a collector to just log the metrics at whatever frequency it wants, and then the Agent just polls the log at the global collection frequency).
Guys, it is not-fixable problem in general, at least for counters (incremental values in Netdata). You need 2 samples to calculate the delta (or rate). So minimum time would be update_every, whatever it is (e.g. 1hr).
If I am not mistaken, the same goes for gauges (absolute values in Netdata), the chart shows the first value after 2 data collections. That looks like a bug to me - we don't calculate delta, no need to wait for the 2nd sample.
I believe there is a workaround for that (can't say if it works now or not) - store_first (CTRL-F store_first) chart option.
There is another feature that contributes to the problem - all data collections should be aligned to update_every. So if:
IOW, allow a collector to just log the metrics at whatever frequency it wants, and then the Agent just polls the log at the global collection frequency
What we do now is:
Pros:
Cons:
Problem
There are many metrics which are important for the user and Netdata should be collecting but where a granularity of 1s does not make sense. Certain metrics make sense at a 30s/60s/300s granularity for example. Other metrics do indeed provide more value at a 1s granularity.
Currently there is a limitation that all metrics need to be collected at the same granularity which causes a problem in the above scenario.
Description
Netdata collector should support collecting different metrics at different granularity. This will allow Netdata to collect a wider range of critical metrics without compromising on high fidelity data collection for metrics where 1s collection matters OR on performance for metrics where 60s or 300s collection is enough.
An example is bloat metrics for postgres tables and indexes - which are too heavy/intensive to be collected every second but provide a lot of value and would be very useful to be collected at a lower granularity (Eg: every 5 minutes) while keeping other Postgres metrics at 1s granularity.
Importance
must have
Value proposition
Proposed implementation
For metrics with granularity larger than 1s an option would be to report 'last collected value' to cloud until the next collection. This means that though the metric is collected only every X seconds the charts on the frontend do not need to change - they will just show the same value for the granularity period.