lf-edge / eve

EVE is Edge Virtualization Engine
https://www.lfedge.org/projects/eve/
Apache License 2.0
468 stars 158 forks source link

Memory Monitor: Adjust pressure level, preserve errno, and improve string handling #4012

Closed OhmSpectator closed 1 week ago

OhmSpectator commented 1 week ago

This pull request includes several enhancements and fixes for the memory-monitor component to improve error handling, robustness, and maintainability of the code.

The main change adjusts the pressure level for the EVE cgroup from "medium" to "critical". This modification aims to reduce the frequency of triggers during system memory reclamation by focusing on more significant memory pressure situations. Another important update ensures that the errno value is preserved during the execution of the SIGHUP signal handler, maintaining the correctness of error handling by storing and restoring errno in non-terminating signal handlers.

Additionally, direct calls to the strtoul function have been replaced with a more robust wrapper, strtoudec, in the cgroups.c and config.c modules, which provides better error handling and improves code robustness. Lastly, the strtodec and strtoudec functions have been modified to handle input strings ending with a newline character, ensuring that such strings are considered valid, a common case when reading from files or the console.

These changes collectively enhance the memory-monitor component by improving its ability to handle memory pressure situations more effectively and providing more reliable string-to-number conversions.

OhmSpectator commented 1 week ago

I'm running ztests for the PR now, and let's see if it has the same issues as before.

OhmSpectator commented 1 week ago

@eriknordmark, I looked for the first logs I see from the sc-supermicro-zc2 device, and at least I do not see the "strtoul: No error information" errors I saw previously, and which I wanted to fix with this PR. Nevertheless, I still see the messages related to #4002. So, this PR fixes some issues, but not the signal one...

eriknordmark commented 1 week ago

@eriknordmark, I looked for the first logs I see from the sc-supermicro-zc2 device, and at least I do not see the "strtoul: No error information" errors I saw previously, and which I wanted to fix with this PR. Nevertheless, I still see the messages related to #4002. So, this PR fixes some issues, but not the signal one...

OK, but that means that we are still causing more SIGUSR2 than before the memory monitor (and we have some EVE+golang issue with handling signals around exec.Run) So it would make sense to up the threshold for the memory monitor for now so it doesn't trigger SIGUSR2 and that will give us two weeks to track down that is going on with signal handling.

OhmSpectator commented 1 week ago

@eriknordmark from the logs, I see that the trigger of the memory handler is a "memory pressure" event generated by the kernel. Currently, I use the "medium" pressure level as a trigger. I can set it to "critical". Maybe it's even a proper solution. See the cgroups doc: https://www.kernel.org/doc/Documentation/cgroup-v1/memory.txt#:~:text=The%20%22critical%22%20level%20means%20that%20the%20system%20is%20actively%20thrashing%2C%20it%20is

OhmSpectator commented 1 week ago

@eriknordmark, I changed the pressure level to "critical" by adding an extra commit to this PR. We can test whether it works as a workaround for now.

christoph-zededa commented 1 week ago

@OhmSpectator what about replacing all %s with '%s' in the log messages so that we see this issue easier in the future?

eriknordmark commented 1 week ago

@OhmSpectator what about replacing all %s with '%s' in the log messages so that we see this issue easier in the future?

How would that change the log output?

OhmSpectator commented 1 week ago

Makes sense to update the PR description based on the added commits.

done

christoph-zededa commented 1 week ago

@OhmSpectator what about replacing all %s with '%s' in the log messages so that we see this issue easier in the future?

How would that change the log output?

We saw Invalid value: 335339520 in the logs and were wondering why it is invalid - it is, because the string was "335339520\n", but I guess syslog() removed the trailing newline.