grafana / grafana-zabbix

Zabbix plugin for Grafana dashboard
http://play.grafana-zabbix.org
Apache License 2.0
2.06k stars 458 forks source link

Fix missing points in data source. #1109

Closed yurtesen closed 3 years ago

yurtesen commented 3 years ago

Describe the bug When Zabbix is using discard/throttle preprocessing, not all data points are available. Grafana naively connecting the dots and this is causing a badly drawn graph. Not that Zabbix itself could do it. Zabbix does it wrong too. But that is not the point 😄

Expected behavior Grafana data sources for MySQL/PostgreSQL etc. have a solution in place already. They do fill a user defined value for missing points. In Zabbix discard case, it should simply fill previous value https://github.com/grafana/grafana/issues/12756

Screenshots Above is Zabbix, and below is Grafana. image

Software versions

Grafana Zabbix Grafana-Zabbix Plugin
7.3.4 5.0.5 4.0.2

Can this be fixed in this plugin?

i-ky commented 3 years ago

Quoting Zabbix documentation:

If a value is discarded, it is not saved in the database and Zabbix server has no knowledge that this value was received. No trigger expressions will be evaluated, as a result, no problems for related triggers will be created/resolved. Trigger functions will work only based on data that is actually saved in database. As trends are built based on data in the database, if there is no value saved for an hour then there will also be no trends data for that hour.

Or, as they say, "no martini no party". Why do these "missing" values need to be plotted? And how will plugin know that they've ever existed? Compared to MySQL/PostgreSQL plugins there aren't even NULLs to replace in Zabbix.

For the record, there is a Zabbix issue on the topic.

P.S. I think throttling feature is flawed from inception and should not be used. Zabbix should grow itself support of time series databases. Most of them provide loseless data compression out of the box. No extra configuration required on user side and no magic tricks are required on the side of visualization tool.

jedd commented 3 years ago

Zabbix should grow itself support of time series databases.

FYI https://www.zabbix.com/documentation/current/manual/appendix/install/timescaledb

No promises on the 'No extra configuration required on the user side' though. : )

i-ky commented 3 years ago

It is still better to configure TimescaleDB compression once than to configure throttling for every item... :)

yurtesen commented 3 years ago

@i-ky throttling is a feature to deal with high frequency monitoring and solves (tries to solve?) a different problem than compression. Without discard, if you are collecting data every 1ms you will still need to write 1000 values per second to database before you compress the data. In addition, this data needs to travel to your database. For example if we are monitoring thousands of) device(s) in a remote network, we can setup a proxy in that network which discards unnecessary value updates before sending to the zabbix server. Compression will not solve these problems, isn't it so?

About the zabbix issue ticket on the topic of graphs. I am aware of it and it is a tricky problem to resolve yes....

I am already using Zabbix with TimescaleDB on PostgreSQl. Wouldn't it be possible to fetch data directly from PostgreSQL leveraging the feature for filling gaps in Grafana PostgreSQL data source?

i-ky commented 3 years ago

@yurtesen, in my understanding the only thing that was (still is?) holding back "high frequency monitoring" was database size. Compression solves this problem. Compression can also solve high network traffic issue.

I think that Zabbix throttling is completely unfit for purpose. It is like a video codec that only compresses stream if it is a still image. If your values have a slightest variation, Zabbix won't be able to discard anything. However, compression algorithms will be able to do pretty good job even in this case. After all, we are streaming 4K video on demand these days.

Yes, it is a tricky issue to resolve. And I believe it should be resolved on Zabbix side. Throttling makes life of every third party visualization and data analysis tool a bit trickier.

yurtesen commented 3 years ago

@i-ky It is interesting example as video compression often employs some sort of discard mechanism to not send unchanged parts of an image. So you are saying that what Zabbix does is a form of compression,

I understand your point and the issue might need to be fixed at Zabbix side. But the fix is probably not undoing the discard feature. Zabbix already support timescaledb and compression. Discarding values have many ohter benefits (like saving processing cycles on triggers and expression evaluations etc.). So I doubt they will revert that feature.

But lets assume the Zabbix API should deal with figuring out missing values related to the discard feature. Independent of that, shouldn't users be allowed to decide how the graph should work in case there are missing values in the timeseries?

For example, what about the direct db connection of this plugin? Can it be adapted to use https://docs.timescale.com/latest/api#locf by a user setting? Independent of what the Zabbix API does?

i-ky commented 3 years ago

Well, throttling has its strengths. It works more or less adequately for textual data. And it absolutely shines with low level discovery. LLD is a database-intense process and discarding unchanged information lets you increase the frequency of LLD rule checks without increasing burden on the database. Thus you achieve low latency LLD, which is very good for dynamic environments. (I prefer "low latency" term to "high frequency", it better describes the goal.)

But throttling is not suited well for numeric data, which needs to be aggregated in trends, interpolated and plotted on graphs. I hope Zabbix will eventually acknowledge that and will restrict throttling to text data. Or users will learn that numeric data should not be discarded. I don't think you can save a lot of cycles on trigger evaluation. I'm not even mentioning that trigger functions like avg() or count() are messed up with throttling just as badly as graphs are. I would definitely not want to account for discarded values in my trigger expression, I would want avg() to be real average and count() to be real count.

To be honest, I don't understand why Grafana's Staircase display option doesn't solve the problem. It also seems to be available in Zabbix. Doesn't Staircase apply the same logic as locf?

yurtesen commented 3 years ago

@i-ky no doubt discard helps LLD to save some cpu cycles but actually LLD require no database access. Zabbix has an in memory configuration cache which it can check against. Discard feature has nothing to do with LLD. whatsoever.

Because a value is numeric does not automatically mean that it will keep changing. I have hosts which has 10k+ values monitored and less than half of those does not change ever. It could be an application counter where the code is never executed. The compression won't help because when I collect 10k items from 1k hosts, I would need to enter 10 million values per interval to database. Then it can be compressed in database and save space. But who cares about that space really, disk space is cheap and plenty. There is real gains in saved processing power by discarding same values.

Staircase,... because staircase is not the correct representation, eg. when I have a value like temperature. Staircase and locf does not apply the same logic. Starcause introduces extra points.

t1 t2 t3 t4 t5
1 3 4
locf t1 t2 t3 t4 t5
1 1 1 3 4
staircase (I think?) t1 t2 t3 t4 t5
1 1 1 1,3 3,4

Sure for the missing data part it works rather well and same, but it breaks the representation of existing data by adding extra values.

yurtesen commented 3 years ago

Compared to MySQL/PostgreSQL plugins there aren't even NULLs to replace in Zabbix.

Just as a note. The gap filling feature of Grafana PostgreSQL/MySQL data sources does NOT work by replacing NULL values either. https://grafana.com/docs/grafana/latest/datasources/postgres/ The missing values are filled with 0, NULL or previous point based on the setting.

borjasueiro commented 3 years ago

Duplicated #388.

yurtesen commented 3 years ago

@borjasueiro is it? It is a very old issue ticket and the description is in Russian? if it is, the last message of @alexanderzobnin he says TSDB should solve it. Zabbix now uses TimescaleDB so, the plugin may possibly use postgresql data source to directly query the data using the timescaledb functionality.

i-ky commented 3 years ago

In https://github.com/alexanderzobnin/grafana-zabbix/issues/388#issuecomment-297429238 Alexander is saying that it is not easy to know where are the gaps/missing values, because Zabbix API does not provide such information. In https://github.com/alexanderzobnin/grafana-zabbix/issues/388#issuecomment-535809221 he says that to solve the problem Zabbix should store null values. My speculation is that it would provide information when checks were performed, but returned no data.

i-ky commented 3 years ago

Starcause introduces extra points.

Sorry, I do not understand. Staircase is how Grafana connects data points. In your example it would draw a horizontal line at 1 from t1 to t4, then a vertical line from 1 to 3 at t4, then a horizontal line at 3 from t4 to t5, then a vertical line from 3 to 4 at t5 and so on.

yurtesen commented 3 years ago

@i-ky First of all, that other ticket is more like a distant cousin from the past dealing with actually really missing data. The discard unchanged feature was added to Zabbix 4.2 which was released in 2019. Which is 2 years after that ticket was created.

The original poster in #388 is mentioning the conversion of NULL values option of Grafana (from screenshot in ticket), so I understand why @alexanderzobnin thinks Zabbix should output NULL values where data is missing "for that specific case". However things have changed quite a bit since 2017. Now also Grafana has additional option for data sources to fill missing data. That feature works independent of NULL values existing or not. Actually, other time series databases can also return empty values when there is no data. This is why they have fill functions. But anyway...

About the solution of the problem. There are few ways to go:

About the staircase. If you draw staircase, you would have 2 data points on t4 and t5

Here is an example. You can count the points for yourself. Staircase 7 points image Same data, non-staircase 4 points image The fun part... between 8:42 and 8:45 there were no data points missing with 1 minute interval. Do you now understand the difference?

i-ky commented 3 years ago

@yurtesen, sorry, I wasn't aware that Staircase draws points where are no real data points.

Your first suggestion is reasonable, but you also need to find a way to distinguish cases when Zabbix is discarding values and when no values are being collected due to maintenance, network issues, etc. I honestly don't understand why such complexity needs to be implemented (and maintained) in a third-party visualization tool if Zabbix itself can't implement this solution for 6 months now.

Your second solution will not work for everyone. There are users of MySQL, Oracle and even PostgreSQL without TimescaleDB. I also have a feeling that locf is not the best solution. According to examples in this blog post, gap is filled with the latest aggregated value. This will give you wrong results: time bucket actual values values in Zabbix DB true average locf(avg(...))
t1 1, 3 1, 3 2 2
t2 3, 3, ... discarded 3 2 (wrong)
yurtesen commented 3 years ago

@i-ky When data is compressed, sometimes it may be acceptable that some data is lost.. Losing why there was a gap is one of those cases. Because in most cases data will be missing due to discard also. For items where this is not an acceptable compromise the user should not enable discard functionality. Grafana PostgreSQL/MySQL gap fill algorithm also does not detect or care why there is a gap in data. So, the solution does not have to be complicated and there is nothing to fix at Zabbix side. It would be an additional feature if Zabbix did it in their API.

https://blog.timescale.com/blog/sql-functions-for-time-series-analysis/ Did you read it? I think this one line from the article explains it all....

Use locf if you assume your measurement changes only when you've received new data.

The article is about aggregating values using time buckets and handling missing data. It says nothing like locf() using some aggregate values. I already posted link to locf() api manual few posts back. Please have a look at that.

yurtesen commented 3 years ago

Yes, the PostgreSQL solution is not universal, but there is no rule as: if the solution is not universal, do not implement it.

There is a switch for TimescaleDB support for PostgreSQL data source in Grafana. So people who does not use it, can't use the features of TimescaleDB. (the feature is implemented even though not everybody uses TimescaleDB)

Sure enougn, this feature does NOT exist for MySQL data source, yet implemented for PostgreSQL. So features get implemented even though not every database supports it. No problem there...

i-ky commented 3 years ago

When data is compressed, sometimes it may be acceptable that some data is lost..

No, lossless compression is a thing, otherwise file archivers wouldn't make much sense.

For items where this is not an acceptable compromise the user should not enable discard functionality.

In my view "user should" and "user should not" is a design flaw. Good design must not allow user to select an option that does not work.

Did you read it?

Yes, I did.

It says nothing like locf() using some aggregate values. I already posted link to locf() api manual few posts back.

I did read it too. It says "The locf function (last observation carried forward) allows you to carry the last seen value in an aggregation group forward. It can only be used in an aggregation query with time_bucket_gapfill." Based on this I assume it is carrying forward an aggregate value of the previous time bucket.

Yes, the PostgreSQL solution is not universal, but there is no rule as: if the solution is not universal, do not implement it.

Indeed, there is no such rule. But there are always cost vs. benefit considerations, especially in projects maintained by small teams and individuals.

yurtesen commented 3 years ago

@i-ky I do not know if you are just joking with me now or not. Your response is totally unrelated.

alexanderzobnin commented 3 years ago

Actually, #1082 fixes this. Now it's required in Grafana to make points aligned (for properly working stacking graphs). So that issue contains changes which allows to align points according to item update interval and fill missing points with null values.

alexanderzobnin commented 3 years ago

Data alignment

Before:

Screenshot from 2020-12-22 16-34-37

After:

Screenshot from 2020-12-22 16-34-26

Filling gaps

Before:

Screenshot from 2020-12-22 16-31-22

After:

Screenshot from 2020-12-22 16-31-08

After (null as zero):

Screenshot from 2020-12-22 16-31-42

yurtesen commented 3 years ago

@alexanderzobnin thanks for the response. I had a quick look at the PR you mentioned. https://github.com/alexanderzobnin/grafana-zabbix/pull/1099/commits/bcf5b8284a939fcd1ac48dc2f6ffe749ec7476cd#diff-6e6b81fb2305c02d4b43fdd8b33cc8fa912b31578833ed6dfc6a04b192e35505 It looks like you are using null for missing values. Can it be improved to use lastvalue as an option?