swaywm / sway

i3-compatible Wayland compositor
https://swaywm.org
MIT License
14.49k stars 1.11k forks source link

Sway hangs and endlessly accumulates file descriptors in some weird circumstances related to Xwayland and the clipboard #7893

Open meithecatte opened 9 months ago

meithecatte commented 9 months ago

I do not know how to reproduce this bug. However, I may have managed to capture enough information when it happened to make it possible to fix it nevertheless.

--> Maja (~quassel@45.142.146.28) has joined #sway \ hello, my sway session has hung on me and i am trying to debug why \ a few minutes ago, it had 207686 file descriptors open. a moment later, it had 415565. I then sent a SIGSTOP to the process \ it's doing the following sequence of syscalls ad infinitum:

pipe2([471095, 471096], 0)              = 0
fcntl(471095, F_SETFD, FD_CLOEXEC)      = 0
fcntl(471095, F_SETFL, O_RDONLY|O_NONBLOCK) = 0
fcntl(471096, F_SETFD, FD_CLOEXEC)      = 0
fcntl(471096, F_SETFL, O_RDONLY|O_NONBLOCK) = 0
poll([{fd=68, events=POLLIN|POLLOUT}], 1, -1) = 1 ([{fd=68, revents=POLLOUT}])
writev(68, [{iov_base="\1\0\t\0Fe\"\0\317\3\0\0\0\0\0\0\n\0\n\0\0\0\1\0@\0\0\0\0\10\0\0"..., iov_len=36}], 1) = 36
poll([{fd=68, events=POLLIN|POLLOUT}], 1, -1) = 1 ([{fd=68, revents=POLLIN|POLLOUT}])
recvmsg(68, {msg_name=NULL, msg_namelen=0, msg_iov=[{iov_base="\20\0\273\255\317\3\0\0Fe\"\0\0\0\0\0\n\0\n\0\0\0\0\0\0\0\0\0\0\0\0\0", iov_len=4096}], msg_iovlen=1, msg_controllen=0, msg_flags=0}, 0) = 32
writev(68, [{iov_base="\30\0\6\0Fe\"\0\1\0\0\0\366\0\0\0\f\1\0\0\0\0\0\0", iov_len=24}], 1) = 24
fcntl(471096, F_SETFL, O_WRONLY|O_NONBLOCK) = 0
fcntl(471095, F_DUPFD_CLOEXEC, 0)       = 471097
epoll_ctl(3, EPOLL_CTL_ADD, 471097, {events=EPOLLIN, data={u32=2913174560, u64=94827201135648}}) = 0
poll([{fd=68, events=POLLIN|POLLOUT}], 1, -1) = 1 ([{fd=68, revents=POLLIN|POLLOUT}])
recvmsg(68, {msg_name=NULL, msg_namelen=0, msg_iov=[{iov_base="\36\0\274\255\0\0\0\0\2\0 \0Fe\"\0\1\0\0\0\366\0\0\0\f\1\0\0\0\0\0\0", iov_len=4096}], msg_iovlen=1, msg_controllen=0, msg_flags=0}, 0) = 32
writev(68, [{iov_base="\16\0\2\0Fe\"\0\2\0\4\0Fe\"\0\0\10\0\0\0\0`\0", iov_len=24}], 1) = 24
poll([{fd=68, events=POLLIN}], 1, -1)   = 1 ([{fd=68, revents=POLLIN}])
recvmsg(68, {msg_name=NULL, msg_namelen=0, msg_iov=[{iov_base="\1\30\275\255\0\0\0\0\317\3\0\0\0\0\0\0\n\0\n\0\0\0\0\0\0\0\0\0\0\0\0\0", iov_len=4096}], msg_iovlen=1, msg_controllen=0, msg_flags=0}, 0) = 32

(addendum: immediately afterwards a call to pipe2 follows, and so on)

Long story short, I established that fd 68 is Xwayland, and captured a backtrace from one of the calls to fcntl in the strace'd sequence (a breakpoint for pipe2 wasn't hitting) \ try to figure out who the client on the other end of fd 68 is \ I suspect that client is repeating a request forever that is using up an fd \ that or attach gdb and get a backtrace for what is calling pipe2 \ the client is Xwayland \ because of course \ lovely \ okay, I *really* don't like what I'm seeing https://paste.debian.net/1302857/ [not reproducing that paste inline because it's not too relevant] \ i'm not sure how to get the symbols to show up here but i know that /usr/lib/dri/radeonsi_dri.so in your backtrace is computer for "this is not a place of honor" \ it's probably calling the syscall via some other wrapper, lovely \ or did you actually find a caller of pipe2? \ well it's the backtrace from a breakpoint on pipe2 being hit \ wait no \ sorry, I did a dumb \ okay yeah, the breakpoint on pipe2 isn't being hit \ but also, gdb is not showing symbols. when it usually does. probably because i had to run it as root \ maybe fcntl might be hit \ not sure *why*, because /proc/sys/kernel/yama/ptrace_scope is already 0 \ ok, that does it
Thread 1 "sway" hit Breakpoint 2, 0x00007f3c471a52f0 in fcntl64 ()
   from /usr/lib/libc.so.6
(gdb) bt
#0  0x00007f3c471a52f0 in fcntl64 () at /usr/lib/libc.so.6
#1  0x00007f3c4741eb52 in  () at /usr/lib/libwayland-server.so.0 [later resolved as wl_os_dupfd_cloexec.constprop.0]
#2  0x00007f3c474207b9 in wl_event_loop_add_fd () at /usr/lib/libwayland-server.so.0
#3  0x00007f3c473a6c01 in  () at /usr/lib/libwlroots.so.11 [later resolved as xwm_selection_transfer_start_outgoing.lto_priv.0]
#4  0x00007f3c473b1651 in  () at /usr/lib/libwlroots.so.11 [later resolved as xwm_map_shell_surface]
#5  0x00007f3c47422b8f in wl_event_loop_dispatch ()
    at /usr/lib/libwayland-server.so.0
#6  0x00007f3c474232d7 in wl_display_run () at /usr/lib/libwayland-server.so.0
#7  0x0000563e98eb2af5 in  ()
#8  0x00007f3c470cdcd0 in  () at /usr/lib/libc.so.6
#9  0x00007f3c470cdd8a in __libc_start_main () at /usr/lib/libc.so.6
#10 0x0000563e98eb2fa5 in  ()
we deliberated on why the symbols aren't there \ doesn't look terribly useful to me :< \ if you can get debug symbols for the wlroots SO that would help point out what needs an FD \ usually gdb offers "do you want to download symbols from the internet for this session" \ but apparently it doesn't do that when you run it as root \ probably helping you be secure :p \ sway generally doesn't need to run in a less-debuggable context but I think some launchers might inherit that anti-ptrace bit \ s/launcher/display manager/ \ but doctor. my launcher is "type exec sway in tty1" \ well, you've confused me. That's how I do it and I can attach strace/gdb. \ i know! it's weird!

\ also, i tried sigstoping just the xwayland process but the session is still hung, sway stops in poll(68) \ I think sway survives killing the Xwayland process \ but that's destructive to X programs, of course \ ok, i tried the sigstop again, and now it got a different timing and the session itself is alive again \ now i get to use the good keyboard and monitor instead of ssh'ing in from my laptop :3 \ > destructive to X programs \ of course \ but also: i want to debug this now \ sadly I have no idea how xwyaland works so won't be much help \ okay, welp, i flew too close to the sun and it got it out of the weird state and can't trigger it again \ fwiw the specific xwayland client was gimp's save file dialog \ and it closed all the file descriptors, too!

nevertheless, i managed to decode the backtrace i captured earlier \ okay, i can still resolve the backtrace manually :p \ mostly for my reference, here's the memory layout of my sway process https://paste.debian.net/1302863/ \ you can turn that plus the gdb bt into offsets into your .so and then gdb a sway process (try a nested one) to get symbols \ just re-apply the offset at the new ASLR \ I've had to do that before when I only had debugsyms on one system and a crash on another \ oh, i was about to nm | sort | eyeball the thing \ apparently one of the calls in that backtrace is xwm_selection_transfer_start_outgoing.lto_priv.0. that looks... fun \ it gets called by xwm_map_shell_surface \ the immediate caller of fcntl64 is wl_os_dupfd_cloexec.constprop.0

\ ah, clipboard stuff? \ the hang happened after I fat-fingered some keybind, so, it might be clipboard-related \ that might be enough for someone who knows wlroots/xwayland to figure out what happened. I suspect a loop of selection-set and paste events or something, resolved by you selecting something while xwayland was paused

joanbm commented 3 months ago

This is probably the same issue as #6974 and/or #7139.