Closed OhmSpectator closed 1 week ago
I'm running ztests for the PR now, and let's see if it has the same issues as before.
@eriknordmark, I looked for the first logs I see from the sc-supermicro-zc2 device, and at least I do not see the "strtoul: No error information" errors I saw previously, and which I wanted to fix with this PR. Nevertheless, I still see the messages related to #4002. So, this PR fixes some issues, but not the signal one...
@eriknordmark, I looked for the first logs I see from the sc-supermicro-zc2 device, and at least I do not see the "strtoul: No error information" errors I saw previously, and which I wanted to fix with this PR. Nevertheless, I still see the messages related to #4002. So, this PR fixes some issues, but not the signal one...
OK, but that means that we are still causing more SIGUSR2 than before the memory monitor (and we have some EVE+golang issue with handling signals around exec.Run) So it would make sense to up the threshold for the memory monitor for now so it doesn't trigger SIGUSR2 and that will give us two weeks to track down that is going on with signal handling.
@eriknordmark from the logs, I see that the trigger of the memory handler is a "memory pressure" event generated by the kernel. Currently, I use the "medium" pressure level as a trigger. I can set it to "critical". Maybe it's even a proper solution. See the cgroups doc: https://www.kernel.org/doc/Documentation/cgroup-v1/memory.txt#:~:text=The%20%22critical%22%20level%20means%20that%20the%20system%20is%20actively%20thrashing%2C%20it%20is
@eriknordmark, I changed the pressure level to "critical" by adding an extra commit to this PR. We can test whether it works as a workaround for now.
@OhmSpectator what about replacing all %s
with '%s'
in the log messages so that we see this issue easier in the future?
@OhmSpectator what about replacing all
%s
with'%s'
in the log messages so that we see this issue easier in the future?
How would that change the log output?
Makes sense to update the PR description based on the added commits.
done
@OhmSpectator what about replacing all
%s
with'%s'
in the log messages so that we see this issue easier in the future?How would that change the log output?
We saw Invalid value: 335339520
in the logs and were wondering why it is invalid - it is, because the string was "335339520\n", but I guess syslog()
removed the trailing newline.
This pull request includes several enhancements and fixes for the memory-monitor component to improve error handling, robustness, and maintainability of the code.
The main change adjusts the pressure level for the EVE cgroup from "medium" to "critical". This modification aims to reduce the frequency of triggers during system memory reclamation by focusing on more significant memory pressure situations. Another important update ensures that the errno value is preserved during the execution of the SIGHUP signal handler, maintaining the correctness of error handling by storing and restoring errno in non-terminating signal handlers.
Additionally, direct calls to the
strtoul
function have been replaced with a more robust wrapper,strtoudec
, in thecgroups.c
andconfig.c
modules, which provides better error handling and improves code robustness. Lastly, thestrtodec
andstrtoudec
functions have been modified to handle input strings ending with a newline character, ensuring that such strings are considered valid, a common case when reading from files or the console.These changes collectively enhance the memory-monitor component by improving its ability to handle memory pressure situations more effectively and providing more reliable string-to-number conversions.