sni / lmd

Livestatus Multitool Daemon - Create livestatus federation from multiple sources
https://labs.consol.de/omd/packages/lmd/
GNU General Public License v3.0
42 stars 31 forks source link

Full update and delta update issue #127

Closed jdumalaonITRS closed 1 year ago

jdumalaonITRS commented 1 year ago

We have an issue and confirmation regarding recent updates to LMD.

Issue: Setting UpdateInterval and FullUpdateInterval = 1 in LMD config, causes some tables to not be updated Issue occurs in https://github.com/sni/lmd/blob/9cfc2f589c8a31cd7f7238d5d196a8f2cb0b5fd6/lmd/peer.go#L388 where the call to data.UpdateDelta(lastUpdate, now) never gets called when PeerStatusUp. It seems that due to the subsecond updates and the configuration, only data.UpdateFull(Objects.UpdateTables) will be executed.

To our surprise this causes some of the columns to not be updated.

We traced it to https://github.com/sni/lmd/blob/9cfc2f589c8a31cd7f7238d5d196a8f2cb0b5fd6/lmd/datastoreset.go#L654 where some tables are skipped during a full update

Workaround is to set the intervals to different values i.e. UpdateInterval = 1, FullUpdateInterval = 2

Confirmation: Is it expected that data.UpdateFull(Objects.UpdateTables) does not update all tables? LMD configuration documentation seems to imply that it does:

# Refresh remote sites every x seconds. # Fast updates are ok, only changed hosts and services get fetched # and once every `FullUpdateInterval` everything gets updated. UpdateInterval = 7

\# Run a full update on all objects every x seconds. Set to zero to turn off \# completely. This is usually not required and only needed if for uncommon \# reasons some updates slip through the normal delta updates. FullUpdateInterval = 600
sni commented 1 year ago

tbh, i never tried to set such low values here. But i guess it would be good to have some kind of sanity check here to prevent setting those values too low. The data update uses 1 second offset to avoid issues while synchronizing the current second. Seems like it misses all of them then when setting the update interval to 1 second.

Besides that, the FullUpdateInterval is basically a catch-all and it should not be necessary to run it that often. If a lot of objects miss the normal update then this might be a sign that something else does not work as expected. I would suggest setting FullUpdateInterval to at least 30-60 seconds.

jdumalaonITRS commented 1 year ago

I understand that the values are low and will probably be changed in the future. This was actually just for test purposes and I doubt that people will actualy change the values to that low.

As further background, we had UpdateInterval and FullUpdateInterval = 1 for a long time until 2.0.9 update (subsecond timestamps) so it used to work. I guess the sub second timestamp handling has tigter tolerances.

jdumalaonITRS commented 1 year ago

Just a followup on our confirmation: Is there any reason why Full Updates only handles dynamic columns and delta update includes them?

Full update https://github.com/sni/lmd/blob/9cfc2f589c8a31cd7f7238d5d196a8f2cb0b5fd6/lmd/datastoreset.go#L645

Delta update https://github.com/sni/lmd/blob/9cfc2f589c8a31cd7f7238d5d196a8f2cb0b5fd6/lmd/datastoreset.go#L161

In our check, comments and downtimes tables (having no dynamic columns) are skipped over in full updates and handled in delta update.

sni commented 1 year ago

comments and downtimes are a bit special, because the number of items changes over time. With normal objects, the number of objects stays the same until the core reloads. And only dynamic column do change during normal operation.

jdumalaonITRS commented 1 year ago

Thanks. I understand.