earlyoom kills more than required when ram isn't cleaned fast enough

9SMTM6 commented 9 months ago

This is probably connected to https://github.com/rfjakob/earlyoom/issues/284, and not reliably repeatable. But when running earlyoom to test the usecase that made me research tools such as it, I found it to occasionally kill a few processes more than should be required.

Here is a log of it happening (at about 14:00) and not happening (at about 14:45):

Feb 16 13:58:29 cloakOfManyColors earlyoom[1233]: mem avail:  2430 of 31808 MiB ( 7.64%), swap free: 1419 of 15903 MiB ( 8.93%)
Feb 16 13:58:29 cloakOfManyColors earlyoom[1233]: low memory! at or below SIGTERM limits: mem 10.00%, swap 10.00%
Feb 16 13:58:29 cloakOfManyColors earlyoom[1233]: sending SIGTERM to process 29410 uid 1000 "cluster": badness 1264, VmRSS 23964 MiB
Feb 16 13:58:29 cloakOfManyColors earlyoom[1233]: process exited after 0.1 seconds
Feb 16 13:58:30 cloakOfManyColors earlyoom[1233]: mem avail:  2511 of 31808 MiB ( 7.90%), swap free: 1317 of 15903 MiB ( 8.28%)
Feb 16 13:58:30 cloakOfManyColors earlyoom[1233]: low memory! at or below SIGTERM limits: mem 10.00%, swap 10.00%
Feb 16 13:58:30 cloakOfManyColors earlyoom[1233]: sending SIGTERM to process 11567 uid 1000 "electron": badness 869, VmRSS 17 MiB
Feb 16 13:58:30 cloakOfManyColors earlyoom[1233]: process exited after 0.1 seconds
Feb 16 13:58:30 cloakOfManyColors earlyoom[1233]: mem avail:  2568 of 31808 MiB ( 8.08%), swap free: 1424 of 15903 MiB ( 8.96%)
Feb 16 13:58:30 cloakOfManyColors earlyoom[1233]: low memory! at or below SIGTERM limits: mem 10.00%, swap 10.00%
Feb 16 13:58:30 cloakOfManyColors earlyoom[1233]: sending SIGTERM to process 11239 uid 1000 "Web Content": badness 822, VmRSS 14 MiB
Feb 16 13:58:30 cloakOfManyColors earlyoom[1233]: process exited after 0.1 seconds
Feb 16 13:58:30 cloakOfManyColors earlyoom[1233]: mem avail:  2651 of 31808 MiB ( 8.34%), swap free: 1349 of 15903 MiB ( 8.49%)
Feb 16 13:58:30 cloakOfManyColors earlyoom[1233]: low memory! at or below SIGTERM limits: mem 10.00%, swap 10.00%
Feb 16 13:58:30 cloakOfManyColors earlyoom[1233]: sending SIGTERM to process 11094 uid 1000 "Web Content": badness 822, VmRSS 14 MiB
Feb 16 13:58:30 cloakOfManyColors earlyoom[1233]: process exited after 0.1 seconds
Feb 16 13:58:30 cloakOfManyColors earlyoom[1233]: mem avail:  2741 of 31808 MiB ( 8.62%), swap free: 1273 of 15903 MiB ( 8.01%)
Feb 16 13:58:30 cloakOfManyColors earlyoom[1233]: low memory! at or below SIGTERM limits: mem 10.00%, swap 10.00%
Feb 16 13:58:30 cloakOfManyColors earlyoom[1233]: sending SIGTERM to process 11322 uid 1000 "Web Content": badness 822, VmRSS 14 MiB
Feb 16 13:58:30 cloakOfManyColors earlyoom[1233]: process exited after 0.1 seconds
Feb 16 13:58:30 cloakOfManyColors earlyoom[1233]: mem avail:  2818 of 31808 MiB ( 8.86%), swap free: 1198 of 15903 MiB ( 7.54%)
Feb 16 13:58:30 cloakOfManyColors earlyoom[1233]: low memory! at or below SIGTERM limits: mem 10.00%, swap 10.00%
Feb 16 13:58:30 cloakOfManyColors earlyoom[1233]: sending SIGTERM to process 11781 uid 1000 "rust-analyzer": badness 812, VmRSS 3 MiB
Feb 16 13:58:30 cloakOfManyColors earlyoom[1233]: process exited after 0.0 seconds
Feb 16 14:19:39 cloakOfManyColors earlyoom[1233]: mem avail: 24073 of 31808 MiB (75.68%), swap free: 13628 of 15903 MiB (85.69%)
Feb 16 14:46:48 cloakOfManyColors earlyoom[1233]: mem avail:     0 of 31808 MiB ( 0.00%), swap free: 1571 of 15903 MiB ( 9.88%)
Feb 16 14:46:48 cloakOfManyColors earlyoom[1233]: low memory! at or below SIGTERM limits: mem 10.00%, swap 10.00%
Feb 16 14:46:48 cloakOfManyColors earlyoom[1233]: sending SIGTERM to process 31647 uid 1000 "cluster": badness 1293, VmRSS 26345 MiB
Feb 16 14:46:49 cloakOfManyColors earlyoom[1233]: process exited after 0.1 seconds

interestingly they are connected: these processes are from vscode and rust-analyzer (lsp for rust, was running for the killed cluster process). And it wasn't fatal, vscode recovered after a warning about the kill.

I'm not sure where their high badness score comes from, and also not why the memory wasn't freed immediately after killing the root cause. The VmRSS noted in the logs is miniscule and when observing the processes separately they don't take up enough memory to be problematic during the spike.

This was done on kernel 6.7.4-arch1-1, if its helpful.

Feel free to close this. If there are no questions from you, and no problematic situations happening to me in the future, I will not put in more time on my own. I just wanted to report this for completeness.

rfjakob commented 9 months ago

Thanks for the report! What's the earlyoom version?

9SMTM6 commented 9 months ago

Arch Package, version 1.7.

rfjakob commented 8 months ago

This looks like a bug in earlyoom. That huge "cluster" process exiting in 0.1 seconds seems inplausible (and indeed, its memory was not freed).

What is this "cluster" process and how can I have it for testing?

On Sat, 17 Feb 2024, 14:32 9SMTM6, @.***> wrote:

Arch Package, version 1.7.

— Reply to this email directly, view it on GitHub https://github.com/rfjakob/earlyoom/issues/309#issuecomment-1950173624, or unsubscribe https://github.com/notifications/unsubscribe-auth/AACGA77SMS23IJ37EEJF5L3YUCWNJAVCNFSM6AAAAABDMC5CTOVHI2DSMVQWIX3LMV43OSLTON2WKQ3PNVWWK3TUHMYTSNJQGE3TGNRSGQ . You are receiving this because you commented.Message ID: @.***>

rfjakob commented 7 months ago

So I found a case where earlyoom thinks the process has exited but it actually has not. It's when the main thread exits but subtreads keeps running. This may be what is happening here.

9SMTM6 commented 7 months ago

Hey, sorry for the late answer, I forgot about this.

The behavior was very hard to reproduce with that setup, that's kind of the reason it wasn't linked immediately. Also hard to create that binary and it's workflow. It's a freshly build Rust binary and it's reading ~50GB of csv.

Here is a link to the repo in case it's really required, but I kind of doubt it's worth the effort for you.

If you actually want to try and reproduce this behavior with that repo let me know, and I'll look at the code again for gotchas and whether it completely documents the workflow that lead to the issue.

You determined this to happen when the main thread exits early? Did not know that could happen, but that would make sense with that binary, and explain why it may be hard to reproduce.

rfjakob commented 7 months ago

I fixed the "zombie main thread" case, it's now merged to master. Could you check if you still see this problem with latest master (i.e. compile latest source code yourself).

9SMTM6 commented 7 months ago

Sure, I gave it a try, earlyoom compiled fine, and I've been unable to reproduce the failure. However It should be noted that that program seems to behave differently now (and as noted it never was reliably reproducible to begin with), and as I'm mobile right now I am not able to repeat it often, as that'd kill my battery. I'm uncertain if a dependency got updated under the hood for my program, which might've changed that behavior.

I could also run your testcases, if that'd be helpful? Otherwise I consider this solved, you went way beyond what I'd have been able to do with my comments;-P.

Thanks!

rfjakob commented 5 months ago

I could also run your testcases, if that'd be helpful?

Always a good idea but I don't expect anything interesting.

I'll close this for now, please report back if you still see the problem!

rfjakob / earlyoom

earlyoom kills more than required when ram isn't cleaned fast enough #309