Closed bsmrs closed 1 year ago
@sni can you check this pull request? It fixed a really nasty bug for us in customers with large configurations. Tks
Could you explain a bit what went wrong and how this change fixes it?
Hey @sni, no problem.
this problem only happened to us on huge configuration, 25k services and 80k services. We detected that a call to the
tablelog->handleNewMessage(this, since, until, logclasses); // memory management
on LogFile.cc would make the typedef map _entries_t point to a invalid pointer. The if (++_num_cached_messages <= _max_cached_messages) was moving to a non existing number of cached messages and somehow afftected the entries. I am gonna be honest with you, although this solved the problem and the log queries are working flawlessly, I dont know why this was happening. Livestatus would segv in this line
char *p = (char *)shiftPointer(data) on OffsetStringColumn.cc.
I hope I could clarify what we did. We spent two weeks on this problem and we almost gave up and move the log feature to a PHP webservices reading the archives.
[]s.
so, by replacing the ++_num_cached_messages with _num_cached_messages+1, in fact the counter will never be increased and will always be zero. This might solve your issue but basically it just set the "max_cached_messages" to an infinite number and disables cleaning the cache. Luckily the cache will be flushed on logrotation, so that did work somehow.
I will look into this... till then, it should help to set max_cached_messages
to hold at least the logfiles for a single day.
I know what you mean, I could not also pin point why this solved the problem. If you want access to the environment where this problems occurs without the patch, drop me a private message. Tks.
closing this one, i think it has been fixed with #103
fix #28