Closed 9SMTM6 closed 5 months ago
Thanks for the report! What's the earlyoom version?
Arch Package, version 1.7.
This looks like a bug in earlyoom. That huge "cluster" process exiting in 0.1 seconds seems inplausible (and indeed, its memory was not freed).
What is this "cluster" process and how can I have it for testing?
On Sat, 17 Feb 2024, 14:32 9SMTM6, @.***> wrote:
Arch Package, version 1.7.
— Reply to this email directly, view it on GitHub https://github.com/rfjakob/earlyoom/issues/309#issuecomment-1950173624, or unsubscribe https://github.com/notifications/unsubscribe-auth/AACGA77SMS23IJ37EEJF5L3YUCWNJAVCNFSM6AAAAABDMC5CTOVHI2DSMVQWIX3LMV43OSLTON2WKQ3PNVWWK3TUHMYTSNJQGE3TGNRSGQ . You are receiving this because you commented.Message ID: @.***>
So I found a case where earlyoom thinks the process has exited but it actually has not. It's when the main thread exits but subtreads keeps running. This may be what is happening here.
Hey, sorry for the late answer, I forgot about this.
The behavior was very hard to reproduce with that setup, that's kind of the reason it wasn't linked immediately. Also hard to create that binary and it's workflow. It's a freshly build Rust binary and it's reading ~50GB of csv.
Here is a link to the repo in case it's really required, but I kind of doubt it's worth the effort for you.
If you actually want to try and reproduce this behavior with that repo let me know, and I'll look at the code again for gotchas and whether it completely documents the workflow that lead to the issue.
You determined this to happen when the main thread exits early? Did not know that could happen, but that would make sense with that binary, and explain why it may be hard to reproduce.
I fixed the "zombie main thread" case, it's now merged to master. Could you check if you still see this problem with latest master (i.e. compile latest source code yourself).
Sure, I gave it a try, earlyoom compiled fine, and I've been unable to reproduce the failure. However It should be noted that that program seems to behave differently now (and as noted it never was reliably reproducible to begin with), and as I'm mobile right now I am not able to repeat it often, as that'd kill my battery. I'm uncertain if a dependency got updated under the hood for my program, which might've changed that behavior.
I could also run your testcases, if that'd be helpful? Otherwise I consider this solved, you went way beyond what I'd have been able to do with my comments;-P.
Thanks!
I could also run your testcases, if that'd be helpful?
Always a good idea but I don't expect anything interesting.
I'll close this for now, please report back if you still see the problem!
This is probably connected to https://github.com/rfjakob/earlyoom/issues/284, and not reliably repeatable. But when running earlyoom to test the usecase that made me research tools such as it, I found it to occasionally kill a few processes more than should be required.
Here is a log of it happening (at about 14:00) and not happening (at about 14:45):
interestingly they are connected: these processes are from vscode and rust-analyzer (lsp for rust, was running for the killed cluster process). And it wasn't fatal, vscode recovered after a warning about the kill.
I'm not sure where their high badness score comes from, and also not why the memory wasn't freed immediately after killing the root cause. The VmRSS noted in the logs is miniscule and when observing the processes separately they don't take up enough memory to be problematic during the spike.
This was done on kernel
6.7.4-arch1-1
, if its helpful.Feel free to close this. If there are no questions from you, and no problematic situations happening to me in the future, I will not put in more time on my own. I just wanted to report this for completeness.