* Fix moving total messages number to invalid position.

bsmrs commented 6 years ago

fix #28

dirtyren commented 6 years ago

@sni can you check this pull request? It fixed a really nasty bug for us in customers with large configurations. Tks

sni commented 6 years ago

Could you explain a bit what went wrong and how this change fixes it?

dirtyren commented 6 years ago

Hey @sni, no problem.

this problem only happened to us on huge configuration, 25k services and 80k services. We detected that a call to the

tablelog->handleNewMessage(this, since, until, logclasses); // memory management

on LogFile.cc would make the typedef map _entries_t point to a invalid pointer. The if (++_num_cached_messages <= _max_cached_messages) was moving to a non existing number of cached messages and somehow afftected the entries. I am gonna be honest with you, although this solved the problem and the log queries are working flawlessly, I dont know why this was happening. Livestatus would segv in this line

char *p = (char *)shiftPointer(data) on OffsetStringColumn.cc.

I hope I could clarify what we did. We spent two weeks on this problem and we almost gave up and move the log feature to a PHP webservices reading the archives.

[]s.

sni commented 6 years ago

so, by replacing the ++_num_cached_messages with _num_cached_messages+1, in fact the counter will never be increased and will always be zero. This might solve your issue but basically it just set the "max_cached_messages" to an infinite number and disables cleaning the cache. Luckily the cache will be flushed on logrotation, so that did work somehow. I will look into this... till then, it should help to set max_cached_messages to hold at least the logfiles for a single day.

dirtyren commented 6 years ago

I know what you mean, I could not also pin point why this solved the problem. If you want access to the environment where this problems occurs without the patch, drop me a private message. Tks.

sni commented 1 year ago

closing this one, i think it has been fixed with #103

naemon / naemon-livestatus

* Fix moving total messages number to invalid position. #29