High CPU until first event is written over IPC socket

sodface commented 1 year ago

While testing scripts with the IPC socket I noticed that one core on my 4 core desktop goes to 100% and stays there until the first event occurs and is read by the script, at which point the CPU usage drops back down and the script and event processing seems to progress as expected from then on.

This also occurs when just connecting directly with nc (I tested with both busybox nc and openbsd-netcat), single core CPU goes to 100% until the first event occurs.

The process at 100% is cagebreak -e so it seems like a cagebreak issue at first glance but maybe it's something on the client side?

I'm a novice at troubleshooting this type of thing, but strace is logging a lot of epoll_pwait calls during high CPU and then those taper off dramatically after the first event and CPU usage goes down.

~ $ doas -c -f strace -p 6983
strace: Process 6983 attached with 11 threads
^Cstrace: Process 6983 detached
strace: Process 6984 detached
strace: Process 6985 detached
strace: Process 6986 detached
strace: Process 6987 detached
strace: Process 6988 detached
strace: Process 6989 detached
strace: Process 6990 detached
strace: Process 6991 detached
strace: Process 6992 detached
strace: Process 6993 detached
% time     seconds  usecs/call     calls    errors syscall
------ ----------- ----------- --------- --------- ----------------
 94.62    2.724267           1   2096153           epoll_pwait
  4.73    0.136285          20      6719        28 ioctl
  0.51    0.014639           5      2913       397 read
  0.09    0.002717           6       428           sendmsg
  0.01    0.000409           3       114           write
  0.01    0.000290           9        32           recvmsg
  0.01    0.000183          10        17           munmap
  0.01    0.000153           8        18           mmap
  0.01    0.000145           6        24           timerfd_settime
  0.00    0.000031           3         8           fcntl
  0.00    0.000014          14         1           accept
  0.00    0.000012           6         2           epoll_ctl
------ ----------- ----------- --------- --------- ----------------
100.00    2.879145           1   2106429       425 total

project-repo commented 1 year ago

Hi sodface

this certainly sounds worrying. However, we just tried to reproduce this bug on two independent machines and were unable to do so.

Could you please state any available details on your setup to enable reproducibility on our side (flags used for compiling cagebreak, OS, exact command which was run, environment (e.g. TTY or nested within a wayland session), etc.)?

cheers project-repo

sodface commented 1 year ago

Hi project-repo, I just did a fresh install on the same hardware. Some details:

Alpine Linux 3.17.3
Cagebreak 2.1.1 (See APKBUILD here)
As for compiler flags... I'm not 100% sure, I'm using the package built from Alpine's builder, the CI build log can viewed here if that helps at all.

I'm starting cagebreak from TTY. Contents of .profile:

#!/bin/sh

mkdir -m 0700 -p /tmp/user/$(id -u)
export GDK_BACKEND=wayland
export HISTSIZE=131172 # 128K lines
export TERMINAL=sakura
export XDG_RUNTIME_DIR=/tmp/user/$(id -u)
export MOZ_ENABLE_WAYLAND=1
export EGL_PLATFORM=wayland

if [ -z "$DISPLAY" ] && [ "$(tty)" = "/dev/tty1" ]; then
    exec dbus-run-session -- cagebreak -e
fi

The screen recording shows my second monitor with 1 split, htop above and then connecting to IPC with busybox nc below. cagebreak -e goes to 100% until I move my mouse from the bottom tile to the top causing an event, at which time CPU usage drops off.

https://user-images.githubusercontent.com/46695384/234123828-b8abbbc6-42fd-4c33-b319-0fbe8b0e8229.mp4

nekopsykose commented 1 year ago

a perf top for the process looks something like

   PerfTop:    3874 irqs/sec  kernel:49.6%  exact: 99.9% lost: 0/0 drop: 0/0 [4000Hz cycles],  (target_pid: 18082)
---------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------                                                                                                                

     7.83%  [kernel]                     [k] __put_user_nocheck_4
     6.37%  libwayland-server.so.0.22.0  [.] wl_event_loop_dispatch
     5.81%  [kernel]                     [k] do_epoll_wait
     5.67%  [kernel]                     [k] __put_user_nocheck_8
     4.57%  libwayland-server.so.0.22.0  [.] 0x0000000000005858
     4.55%  libwayland-client.so.0.22.0  [.] 0x0000000000003fbf
     4.07%  [kernel]                     [k] __entry_text_start
     3.55%  ld-musl-x86_64.so.1          [.] __syscall_cp_c
     3.03%  [kernel]                     [k] syscall_enter_from_user_mode
     2.67%  ld-musl-x86_64.so.1          [.] __pthread_mutex_lock
     2.53%  ld-musl-x86_64.so.1          [.] __pthread_mutex_unlock
     2.50%  [kernel]                     [k] entry_SYSRETQ_unsafe_stack
     2.33%  [kernel]                     [k] __fget_files
     1.99%  [kernel]                     [k] _raw_write_lock_irq
     1.96%  libwayland-client.so.0.22.0  [.] wl_display_dispatch_queue_pending@plt
     1.92%  ld-musl-x86_64.so.1          [.] __syscall_cp_asm
     1.83%  [kernel]                     [k] do_syscall_64
     1.67%  [kernel]                     [k] unix_poll
     1.61%  [kernel]                     [k] entry_SYSCALL_64_after_hwframe
     1.55%  libwayland-client.so.0.22.0  [.] pthread_mutex_unlock@plt
     1.51%  libwayland-client.so.0.22.0  [.] wl_display_dispatch_queue_pending
     1.47%  ld-musl-x86_64.so.1          [.] __cp_end
     1.35%  [kernel]                     [k] syscall_exit_to_user_mode
     1.23%  [kernel]                     [k] __se_sys_epoll_pwait
     1.13%  [kernel]                     [k] _raw_write_unlock_irq
     1.12%  libwayland-server.so.0.22.0  [.] wl_event_loop_dispatch_idle
     1.10%  [kernel]                     [k] mutex_lock
     1.09%  libwayland-server.so.0.22.0  [.] wl_list_empty
     1.07%  ld-musl-x86_64.so.1          [.] epoll_pwait
     0.99%  [kernel]                     [k] ep_done_scan
     0.99%  [kernel]                     [k] sock_poll
     0.97%  libwayland-client.so.0.22.0  [.] wl_display_flush
     0.93%  [kernel]                     [k] syscall_return_via_sysret
     0.93%  libwayland-server.so.0.22.0  [.] wl_display_flush_clients
     0.86%  ld-musl-x86_64.so.1          [.] __syscall_ret
     0.86%  libwayland-client.so.0.22.0  [.] pthread_mutex_lock@plt
     0.82%  libwayland-client.so.0.22.0  [.] wl_list_empty@plt
     0.64%  libwayland-server.so.0.22.0  [.] wl_event_loop_dispatch_idle@plt
     0.60%  [kernel]                     [k] __fdget
     0.59%  libwayland-server.so.0.22.0  [.] wl_display_run
     0.59%  cagebreak                    [.] ipc_client_handle_writable
     0.56%  libwayland-server.so.0.22.0  [.] wl_list_init
     0.54%  [kernel]                     [k] exit_to_user_mode_prepare
     0.52%  libwlroots.so.11             [.] 0x0000000000053d4a
     0.41%  libwayland-server.so.0.22.0  [.] 0x0000000000004822
     0.40%  [kernel]                     [k] mutex_unlock
     0.31%  [kernel]                     [k] set_user_sigmask
     0.31%  libwlroots.so.11             [.] 0x0000000000053d43

i'd guess something calls ipc_client_handle_writable repeatedly until an event is triggered for some reason

nekopsykose commented 1 year ago

i can't get any readable callgraph, but more or less it starts in wlroots

project-repo commented 1 year ago

Hi everyone, Thanks a lot for the help with debugging and additional system information. We are now able to reproduce this and pushed a tentative fix for it yesterday (see development branch). Can you confirm that this fixes the issue? Cheers, project-repo

nekopsykose commented 1 year ago

fixes it for me in my testing

sodface commented 1 year ago

I'm not as fast as psykose but I'll test asap. Thanks!

sodface commented 1 year ago

Fixes it for me too. Thanks again!

project-repo commented 1 year ago

Fixed in 2.2.0

project-repo / cagebreak

High CPU until first event is written over IPC socket #46