rfjakob / earlyoom

earlyoom - Early OOM Daemon for Linux
MIT License
2.93k stars 155 forks source link

earlyoom crashes instead of restart? #321

Open quirinmanz opened 2 months ago

quirinmanz commented 2 months ago

Hello,

First of all, thanks for the great tool. It has prevented a lot of crashes for us.

Recently, we ran into the following problem running earlyoom v1.7-41-g90f1a67:

Jul 18 08:23:17 server.name earlyoom[2608931]: earlyoom v1.7-41-g90f1a67
Jul 18 08:23:17 server.name earlyoom[2608931]: mem total: 773635 MiB, user mem total: 760464 MiB, swap total: 2047 MiB
Jul 18 08:23:17 server.name earlyoom[2608931]: sending SIGTERM when mem <= 10.00% and swap <= 10.00%,
Jul 18 08:23:17 server.name earlyoom[2608931]:         SIGKILL when mem <=  5.00% and swap <=  5.00%
Jul 18 08:23:17 server.name earlyoom[2608931]: mem avail: 55499 of 760640 MiB ( 7.30%), swap free:    0 of 2047 MiB ( 0.00%)
Jul 18 08:23:17 server.name earlyoom[2608931]: low memory! at or below SIGTERM limits: mem 10.00%, swap 10.00%
Jul 18 08:23:17 server.name earlyoom[2608931]: sending SIGTERM to process 2595764 uid 0 "cellxgene": badness 711, VmRSS 52562 MiB
Jul 18 08:23:17 server.name earlyoom[2608931]: process 2595764 cmdline "command here"
Jul 18 08:23:18 server.name systemd[1]: earlyoom.service: Main process exited, code=dumped, status=31/SYS
Jul 18 08:23:18 server.name systemd[1]: earlyoom.service: Failed with result 'core-dump'.
Jul 18 08:23:18 server.name systemd[1]: earlyoom.service: Consumed 1.152s CPU time.
Jul 18 08:23:18 server.name systemd[1]: earlyoom.service: Scheduled restart job, restart counter is at 6.
Jul 18 08:23:18 server.name systemd[1]: Stopped Early OOM Daemon.
Jul 18 08:23:18 server.name systemd[1]: earlyoom.service: Consumed 1.152s CPU time.
Jul 18 08:23:18 server.name systemd[1]: earlyoom.service: Start request repeated too quickly.
Jul 18 08:23:18 server.name systemd[1]: earlyoom.service: Failed with result 'core-dump'.
Jul 18 08:23:18 server.name systemd[1]: Failed to start Early OOM Daemon.

From our perspective, it looks like SIGTERM failed (maybe the process didn't respond?), and at some point, early room restarts failed as well. Setting the SIGKILL threshold higher could combat this, but it is not really how this was intended, right? Do you have any idea what could be going on here?

Best, Quirin

rfjakob commented 2 months ago

Hi, that's a bug in the earlyoom.service file, fixed by https://github.com/rfjakob/earlyoom/commit/c171b72ba217e923551bdde7e7f00ec5a0488b54 and released in earlyoom v1.8.2

On Thu, 18 Jul 2024, 10:13 Quirin Manz, @.***> wrote:

Hello,

First of all, thanks for the great tool. It has prevented a lot of crashes for us.

Recently, we ran into the following problem running earlyoom v1.7-41-g90f1a67:

Jul 18 08:23:17 server.name earlyoom[2608931]: earlyoom v1.7-41-g90f1a67 Jul 18 08:23:17 server.name earlyoom[2608931]: mem total: 773635 MiB, user mem total: 760464 MiB, swap total: 2047 MiB Jul 18 08:23:17 server.name earlyoom[2608931]: sending SIGTERM when mem <= 10.00% and swap <= 10.00%, Jul 18 08:23:17 server.name earlyoom[2608931]: SIGKILL when mem <= 5.00% and swap <= 5.00% Jul 18 08:23:17 server.name earlyoom[2608931]: mem avail: 55499 of 760640 MiB ( 7.30%), swap free: 0 of 2047 MiB ( 0.00%) Jul 18 08:23:17 server.name earlyoom[2608931]: low memory! at or below SIGTERM limits: mem 10.00%, swap 10.00% Jul 18 08:23:17 server.name earlyoom[2608931]: sending SIGTERM to process 2595764 uid 0 "cellxgene": badness 711, VmRSS 52562 MiB Jul 18 08:23:17 server.name earlyoom[2608931]: process 2595764 cmdline "command here" Jul 18 08:23:18 server.name systemd[1]: earlyoom.service: Main process exited, code=dumped, status=31/SYS Jul 18 08:23:18 server.name systemd[1]: earlyoom.service: Failed with result 'core-dump'. Jul 18 08:23:18 server.name systemd[1]: earlyoom.service: Consumed 1.152s CPU time. Jul 18 08:23:18 server.name systemd[1]: earlyoom.service: Scheduled restart job, restart counter is at 6. Jul 18 08:23:18 server.name systemd[1]: Stopped Early OOM Daemon. Jul 18 08:23:18 server.name systemd[1]: earlyoom.service: Consumed 1.152s CPU time. Jul 18 08:23:18 server.name systemd[1]: earlyoom.service: Start request repeated too quickly. Jul 18 08:23:18 server.name systemd[1]: earlyoom.service: Failed with result 'core-dump'. Jul 18 08:23:18 server.name systemd[1]: Failed to start Early OOM Daemon.

From our perspective, it looks like SIGTERM failed (maybe the process didn't respond?), and at some point, early room restarts failed as well. Setting the SIGKILL threshold higher could combat this, but it is not really how this was intended, right? Do you have any idea what could be going on here?

Best, Quirin

— Reply to this email directly, view it on GitHub https://github.com/rfjakob/earlyoom/issues/321, or unsubscribe https://github.com/notifications/unsubscribe-auth/AACGA74SO73QUYWXKB2PC2TZM52KXAVCNFSM6AAAAABLCEPMQ6VHI2DSMVQWIX3LMV43ASLTON2WKOZSGQYTKNRZGI2TOMQ . You are receiving this because you are subscribed to this thread.Message ID: @.***>