mrmekon / circadian

Suspend-On-Idle Daemon for GNU/Linux Power Management
GNU General Public License v3.0
254 stars 16 forks source link

segfaults every 5 seconds #28

Open Bulmer opened 7 months ago

Bulmer commented 7 months ago

I am running Debian sid (Linux 6.7.9-amd64) and getting these errors every 5 seconds.

Apr 09 18:30:21 backups kernel: w[55548]: segfault at 4c ip 000055658bbff09e sp 00007ffe495bb270 error 4 in w[55658bbfe000+3000] likely on CPU 2 (core 2, socket 0)
Apr 09 18:30:16 backups kernel: Code: 0f be c3 48 8b 12 f6 44 42 01 40 74 7f 80 fb 20 74 7a 0f be fb 48 83 c5 01 e8 6e f1 ff ff 49 39 ee 0f 84 7d 00 00 00 45 89 e7 <0f> b6 5d 00 84 db 75 ba 45 85 ff 75 65 48 8b 35 76 3f 00 00 bf 2d
Apr 09 18:30:16 backups kernel: w[55495]: segfault at 4c ip 000055f96ea4509e sp 00007fff0e223ec0 error 4 in w[55f96ea44000+3000] likely on CPU 0 (core 0, socket 0)
Apr 09 18:30:11 backups kernel: Code: 0f be c3 48 8b 12 f6 44 42 01 40 74 7f 80 fb 20 74 7a 0f be fb 48 83 c5 01 e8 6e f1 ff ff 49 39 ee 0f 84 7d 00 00 00 45 89 e7 <0f> b6 5d 00 84 db 75 ba 45 85 ff 75 65 48 8b 35 76 3f 00 00 bf 2d
Apr 09 18:30:11 backups kernel: w[55450]: segfault at 4c ip 00005620f0d4a09e sp 00007fff067d9410 error 4 in w[5620f0d49000+3000] likely on CPU 1 (core 1, socket 0)
Apr 09 18:30:11 backups circadian[55443]: Configuration valid.  Idle detection starting.
Apr 09 18:30:11 backups circadian[55443]: Error setting auto-wake timer: Auto-wake not enabled.
bittorf commented 3 months ago

Same here:

bastian@ryzen:~$ sudo dmesg -c
[  420.599084] w[3941]: segfault at 4c ip 0000562d27e7409e sp 00007ffd7dccfd80 error 4 in w[562d27e73000+3000] likely on CPU 3 (core 3, socket 0)
[  420.599106] Code: 0f be c3 48 8b 12 f6 44 42 01 40 74 7f 80 fb 20 74 7a 0f be fb 48 83 c5 01 e8 6e f1 ff ff 49 39 ee 0f 84 7d 00 00 00 45 89 e7 <0f> b6 5d 00 84 db 75 ba 45 85 ff 75 65 48 8b 35 76 3f 00 00 bf 2d
[  420.623234] w[3946]: segfault at 4c ip 0000557627ab209e sp 00007ffd420bf770 error 4 in w[557627ab1000+3000] likely on CPU 6 (core 2, socket 0)
[  420.623255] Code: 0f be c3 48 8b 12 f6 44 42 01 40 74 7f 80 fb 20 74 7a 0f be fb 48 83 c5 01 e8 6e f1 ff ff 49 39 ee 0f 84 7d 00 00 00 45 89 e7 <0f> b6 5d 00 84 db 75 ba 45 85 ff 75 65 48 8b 35 76 3f 00 00 bf 2d
[  420.662887] w[3952]: segfault at 4c ip 000056436296509e sp 00007ffc5e4b9880 error 4 in w[564362964000+3000] likely on CPU 3 (core 3, socket 0)
[  420.662897] Code: 0f be c3 48 8b 12 f6 44 42 01 40 74 7f 80 fb 20 74 7a 0f be fb 48 83 c5 01 e8 6e f1 ff ff 49 39 ee 0f 84 7d 00 00 00 45 89 e7 <0f> b6 5d 00 84 db 75 ba 45 85 ff 75 65 48 8b 35 76 3f 00 00 bf 2d

bastian@ryzen:~$ sudo circadian -t
Reading config from file: /etc/circadian.conf
Circadian launching.
CircadianConfig { verbosity: 0, idle_time: 600, auto_wake: None, on_idle: Some("systemctl hybrid-sleep"), on_wake: Some("date >>wake_up_from_circadian_sleep.log.txt"), tty_input: true, x11_input: true, ssh_block: false, smb_block: false, nfs_block: false, audio_block: true, max_cpu_load: Some(0.9), process_block: ["^dd$", "^rsync$", "^apt-get$", "^dpkg$", "^cp$", "^mv$", "^restic$", "^mpv$"] }
Got --test: running idle test and exiting.
Idle Detection Summary:
w*              : cannot parse integer from empty string
xssstate*       : 0
xprintidle*     : 0
Wake block      : 600
TTY (combined)* : 4294967295
X11 (combined)* : 0
Idle (min)      : 0
Idle target     : 600
Until idle      : 600
IDLE?           : false
CPU load*       : false
SSH             : false
SMB             : false
NFS             : false
Audio*          : true
Processes*      : false
BLOCKED?        : true

bastian@ryzen:~$ circadian --version
circadian 0.8.4

bastian@ryzen:~$ uname -a
Linux ryzen 6.9.10-amd64 #1 SMP PREEMPT_DYNAMIC Debian 6.9.10-1 (2024-07-19) x86_64 GNU/Linux

bastian@ryzen:~$ cat /etc/*release*
PRETTY_NAME="Debian GNU/Linux trixie/sid"
NAME="Debian GNU/Linux"
VERSION_CODENAME=trixie
ID=debian
HOME_URL="https://www.debian.org/"
SUPPORT_URL="https://www.debian.org/support"
BUG_REPORT_URL="https://bugs.debian.org/"
mrmekon commented 3 months ago

So both of you are getting segfaults in the w application that circadian calls. You can set tty_input = no in the circadian config to tell it not to use w, but that's just a workaround for the system being broken... is Debian really shipping a broken w? What happens if you just run w in a terminal as root?

sghael commented 3 months ago

w on Ubuntu 24.04 returns info just fine. Something else may be failing in the parsing. My outputs


Reading config from file: /etc/circadian.conf
Circadian launching.
CircadianConfig { verbosity: 0, idle_time: 900, auto_wake: None, on_idle: Some("systemctl suspend"), on_wake: Some("/home/sss/.scripts/cuda_reset.bash"), tty_input: true, x11_input: true, ssh_block: false, smb_block: false, nfs_block: false, audio_block: true, max_cpu_load: Some(0.5), process_block: ["^dd$", "^rsync$", "^apt-get$", "^dpkg$", "^cp$", "^mv$", "^poddl$", "^borg$"] }
Got --test: running idle test and exiting.
WARNING: xssstate failed for socket /tmp/.X11-unix/X0 with error: No such file or directory (os error 2)
WARNING: xssstate failed for socket /tmp/.X11-unix/X0 with error: No such file or directory (os error 2)
WARNING: xprintidle failed for socket /tmp/.X11-unix/X0 with error: No such file or directory (os error 2)
WARNING: xprintidle failed for socket /tmp/.X11-unix/X0 with error: No such file or directory (os error 2)
Idle Detection Summary:
w*              : cannot parse integer from empty string
xssstate*       : No displays found.
xprintidle*     : No displays found.
Wake block      : 900
TTY (combined)* : 4294967295
X11 (combined)* : 4294967295
Idle (min)      : 4294967295
Idle target     : 900
Until idle      : 0
IDLE?           : false
CPU load*       : false
SSH             : No such file or directory (os error 2)
SMB             : No such file or directory (os error 2)
NFS             : No such file or directory (os error 2)
Audio*          : false
Processes*      : false
BLOCKED?        : false

❯ sudo w
 16:49:48 up 1 day,  5:39,  3 users,  load average: 0.00, 0.02, 0.00
USER     TTY      FROM             LOGIN@   IDLE   JCPU   PCPU WHAT
gdm      tty1     -                Wed11   24:43m 44.19s  0.02s /usr/libexec/gsd-printer
sss            10.10.10.80      16:39   29:38m  0.00s  0.01s sshd: sss [priv]
sss            10.10.10.80      16:48   29:38m  0.00s  0.01s sshd: sss [priv]````
mrmekon commented 3 months ago

@sghael Are you also getting segfaults in your dmesg log? Does coredumpctl list show w for any of the entries?

In the two dmesg logs above, it clearly shows that w is crashing ("w[55548]: segfault..."), which is what leads Circadian to fail to parse the output... there is no output.

If w is crashing when Circadian launches it and not when launched from the shell, it could be because of a difference in arguments or a difference in execution environment. Circadian launches it with a variety of different arguments: w -us, w -hus, w -us CIRCADIAN_FAKEUSER, w -usf CIRCADIAN_FAKEUSER, w -hus, or w -husf. All of these should run without crashing.

Can everyone who experiences this please post your OS version (cat /etc/issue), w version (w --version), and ideally dmesg log? If w is listed in your coredumpctl entries, it could also be helpful to post the result of coredumpctl info w.

sghael commented 2 months ago

───────┬─────────────────────────────────────────────────────────────────────────────────────────────────────────────
       │ File: /etc/issue
───────┼─────────────────────────────────────────────────────────────────────────────────────────────────────────────
   1   │ Ubuntu 24.04.1 LTS \n \l
   2   │
───────┴─────────────────────────────────────────────────────────────────────────────────────────────────────────────
❯ w --version
w from procps-ng 4.0.4
❯ coredumpctl info w
           PID: 8362 (w)
           UID: 0 (root)
           GID: 0 (root)
        Signal: 11 (SEGV)
     Timestamp: Fri 2024-08-30 11:22:21 EDT (54s ago)
  Command Line: w -us
    Executable: /usr/bin/w
 Control Group: /user.slice/user-1000.slice/session-18.scope
          Unit: session-18.scope
         Slice: user-1000.slice
       Session: 18
     Owner UID: 1000 (sghael)
       Boot ID: b7fa120af98d47ffbe2d4cc-----
    Machine ID: 6b40ab502b904fe2aa17aa-----
      Hostname: neural
       Storage: /var/lib/systemd/coredump/core.w.0.b7fa120af98d47ffbe2d4cc26055ce5f.8362.1725031341000000.zst (inacc>
       Message: Process 8362 (w) of user 0 dumped core.

                Module libzstd.so.1 from deb libzstd-1.5.5+dfsg2-2build1.1.amd64
                Module libsystemd.so.0 from deb systemd-255.4-1ubuntu8.4.amd64
                Stack trace of thread 8362:
                #0  0x000062c249d9ee73 n/a (w + 0x2e73)
                #1  0x000062c249d9f161 n/a (w + 0x3161)
                #2  0x000062c249da00e0 n/a (w + 0x40e0)
                #3  0x000062c249d9ea8f n/a (w + 0x2a8f)
                #4  0x000073e2ea22a1ca __libc_start_call_main (libc.so.6 + 0x2a1ca)
                #5  0x000073e2ea22a28b __libc_start_main_impl (libc.so.6 + 0x2a28b)
                #6  0x000062c249d9ed25 n/a (w + 0x2d25)
                ELF object binary architecture: AMD x86-64

           PID: 8380 (w)
           UID: 0 (root)
           GID: 0 (root)
        Signal: 11 (SEGV)
     Timestamp: Fri 2024-08-30 11:22:21 EDT (54s ago)
  Command Line: w -hus
    Executable: /usr/bin/w
 Control Group: /user.slice/user-1000.slice/session-18.scope
          Unit: session-18.scope
         Slice: user-1000.slice
       Session: 18
     Owner UID: 1000 (sghael)
       Boot ID: b7fa120af98d47ffbe2d4cc260-----
    Machine ID: 6b40ab502b904fe2aa17aa5f1ac-----
      Hostname: neural
       Storage: /var/lib/systemd/coredump/core.w.0.b7fa120af98d47ffbe2d4cc26055ce5f.8380.1725031341000000.zst (inacc>
       Message: Process 8380 (w) of user 0 dumped core.

                Module libzstd.so.1 from deb libzstd-1.5.5+dfsg2-2build1.1.amd64
                Module libsystemd.so.0 from deb systemd-255.4-1ubuntu8.4.amd64
                Stack trace of thread 8380:
                #0  0x00005deb31925e73 n/a (w + 0x2e73)
                #1  0x00005deb31926161 n/a (w + 0x3161)
                #2  0x00005deb319270e0 n/a (w + 0x40e0)
                #3  0x00005deb31925a8f n/a (w + 0x2a8f)
                #4  0x000077132962a1ca __libc_start_call_main (libc.so.6 + 0x2a1ca)
                #5  0x000077132962a28b __libc_start_main_impl (libc.so.6 + 0x2a28b)
                #6  0x00005deb31925d25 n/a (w + 0x2d25)
                ELF object binary architecture: AMD x86-64

           PID: 8403 (w)
           UID: 0 (root)
           GID: 0 (root)
        Signal: 11 (SEGV)
     Timestamp: Fri 2024-08-30 11:22:21 EDT (54s ago)
  Command Line: w -hus
    Executable: /usr/bin/w
 Control Group: /user.slice/user-1000.slice/session-18.scope
          Unit: session-18.scope
         Slice: user-1000.slice
       Session: 18
     Owner UID: 1000 (sghael)
       Boot ID: b7fa120af98d47ffbe2d4cc260-----
    Machine ID: 6b40ab502b904fe2aa17aa5f1ac-----
      Hostname: neural
       Storage: /var/lib/systemd/coredump/core.w.0.b7fa120af98d47ffbe2d4cc26055ce5f.8403.1725031341000000.zst (inacc>
       Message: Process 8403 (w) of user 0 dumped core.

                Module libzstd.so.1 from deb libzstd-1.5.5+dfsg2-2build1.1.amd64
                Module libsystemd.so.0 from deb systemd-255.4-1ubuntu8.4.amd64
                Stack trace of thread 8403:
                #0  0x0000598636812e73 n/a (w + 0x2e73)
                #1  0x0000598636813161 n/a (w + 0x3161)
                #2  0x00005986368140e0 n/a (w + 0x40e0)
                #3  0x0000598636812a8f n/a (w + 0x2a8f)
                #4  0x00007d2cda82a1ca __libc_start_call_main (libc.so.6 + 0x2a1ca)
                #5  0x00007d2cda82a28b __libc_start_main_impl (libc.so.6 + 0x2a28b)
                #6  0x0000598636812d25 n/a (w + 0x2d25)
                ELF object binary architecture: AMD x86-64```
sghael commented 2 months ago

this seems like a potential problem https://github.com/mrmekon/circadian/blob/5dd06665d38c70f3f100d253e98be3c82e453efc/src/main.rs#L518

on my Ubuntu 24.04:


───────┬─────────────────────────────────────────────────────────────────────────────────────────────────────────────
       │ File: /etc/issue
───────┼─────────────────────────────────────────────────────────────────────────────────────────────────────────────
   1   │ Ubuntu 24.04.1 LTS \n \l
   2   │
───────┴─────────────────────────────────────────────────────────────────────────────────────────────────────────────
❯ w --version
w from procps-ng 4.0.4
❯ w -hus
zsh: segmentation fault (core dumped)  w -hus
sghael commented 2 months ago

interesting:

❯ w -h
sghael            10.10.10.199     09:49    1:37m  0.00s  0.02s sshd: sghael [priv]
gdm      tty1     -                10:01    7:23m  4.93s  0.01s /usr/libexec/gsd-printer

❯ w -u
 11:38:44 up  3:23,  2 users,  load average: 0.05, 0.05, 0.01
USER     TTY      FROM             LOGIN@   IDLE   JCPU   PCPU WHAT
sghael            10.10.10.199     09:49    1:37m  0.00s  0.02s sshd: sghael [priv]
gdm      tty1     -                10:01    7:23m  4.93s  0.01s /usr/libexec/gsd-printer

❯ w -s
 11:38:47 up  3:23,  2 users,  load average: 0.04, 0.05, 0.01
USER     TTY      FROM              IDLE WHAT
zsh: segmentation fault (core dumped)  w -s
mrmekon commented 2 months ago

Oh wow, that's very helpful, thanks!

It looks like Ubuntu and Debian are shipping with a w that segfaults on the --short option?? That's extremely unexpected... that should just be hiding fields.

Also unusual because mine (Arch) reports the same version, but shows different default fields and doesn't segfault:

$ w --version
w from procps-ng 4.0.4
$ w -u
 17:44:25 up 68 days,  2:51,  2 users,  load average: 1.02, 1.07, 0.92
USER     TTY       LOGIN@   IDLE   JCPU   PCPU  WHAT
trevor   tty1      23Jun24  4days 16:53m   ?    xinit /home/trevor/.xinitrc -- /etc/X11/xinit/xserverrc :0 vt1 -keeptty -auth /tmp/serverauth.Bpopi5xqRs
trevor             23Jun24 24:52m  0.00s  1:06  /usr/lib/systemd/systemd --user --deserialize=9
$ w -s
 17:43:15 up 68 days,  2:49,  2 users,  load average: 1.64, 1.17, 0.94
USER     TTY         IDLE WHAT
trevor   tty1       4days xinit /home/trevor/.xinitrc -- /etc/X11/xinit/xserverrc :0 vt1 -keeptty -auth /tmp/serverauth.Bpopi5xqRs
trevor             24:51m /usr/lib/systemd/systemd --user --deserialize=9

The root problem is either with procps-ng or Debian or Ubuntu, depending on whether it is patched by the distros or not, but it would be nice to find a workaround on the Circadian side in the meantime.

If I get time (big if), I'll do some testing in an Ubuntu VM and see if I can reproduce and work around it.

mrmekon commented 2 months ago

Ok, it's upstream and it was already patched 11 months ago, but not released yet. It has something to do with the systemd environment, which is probably why only some of us hit it.

Patch in procps-ng: https://gitlab.com/procps-ng/procps/-/commit/79042e07fab9956135a21b1df7a69d1fbde7ef79 Issue where it was reported: https://gitlab.com/procps-ng/procps/-/issues/301

sghael commented 2 months ago

yep, that's it. Thanks for finding the upstream bug.

I'm looking at the circadian code and wonder if it solves the problem if 1) w_from_args removed all references to the -s flag 2) idle_w also didn't use the -s flag and then explicitly calculates the idle field offset, instead of assuming it is always the second to last.

mrmekon commented 2 months ago

Yes, dropping -s and counting fields is an option.

But now that I see what it's doing, I'm going to look into bypassing w entirely. It's a lot of sketchy string parsing for what appears to be just checking the atime of /dev/tty*, which can just be done directly. Most of its logic is about choosing which tty's to check... I wonder if there's anything wrong with just checking all of them.

sghael commented 2 months ago

@mrmekon I took a crack at removing w. I tested a bit on my Ubuntu box and the tty settings work now. I am less sure if the X11 checks are still working properly. It would be great to add some unit test here based on expected signals, but I've left that for another day.

PR: https://github.com/mrmekon/circadian/pull/32