swaywm / sway

i3-compatible Wayland compositor
https://swaywm.org
MIT License
14.14k stars 1.08k forks source link

sway leaves drm/kms console in unrecoverable state #4540

Open ghost opened 4 years ago

ghost commented 4 years ago

If the running-as-unprivileged-user sway process dies unexpectedly the system drm/kms console is left in an unrecoverable state. The console comes back up, but the cursor doesn't blink and pressing keys on the keyboard does nothing. The only way to recover is (a) power cycle or (b) ssh in remotely and start Xorg, which knows how to fix things.

This is hugely awkward for laptops when you're someplace without a second machine and network connection from which it is safe to log in to the stuck laptop. It's also a pain for people hacking on sway.

If there is some documentation somewhere that explains how to "reset the DRM state" please point me to it and I'll write a recovery utility. I searched and could not find anything explaining how to do this. Evidently Xorg knows how to do it, but searching for the line or two of code that accomplishes this in their very very large codebase seems like a really difficult way of going about fixing this.

To reproduce:

  1. Start sway

  2. There will be two sway processes, one of them running as root. Send a "kill -9" to the other one.

Note that "kill -9" is simply the most reliable way to cause the problem. Many different flavors of crashage manifest this way.

Sway version: master (0ad5e355bd8c5035f9219aa068418c38a6bbd4b8)

Debug log:

... normal stuff
... kill -9 sent
2019-09-04 06:17:53 - [common/ipc-client.c:88] Unable to receive IPC response
emersion commented 4 years ago

Yes, it would be nice to somehow make the child helper exit cleanly if the parent dies.

ghost commented 4 years ago

Zombie processes I can live with... the frozen console is a bigger problem.

ghost commented 4 years ago

Related inquiry: https://stackoverflow.com/questions/38978526/need-to-invoke-ioctltty0-fd-kdsetmode-kd-text-upon-abnormal-termination

ghost commented 4 years ago

The following C program will fix the console after a sway crash. Remaining issue is if/how to ensure that this code gets invoked when sway dies ungracefully. In the meantime I've bound it to a non-keyboard input switch on my laptop.

#include <stdio.h>
#include <stdlib.h>
#include <fcntl.h>
#include <linux/vt.h>
#include <linux/kd.h>

void main() {
  int fd = open("/dev/tty0", O_RDWR);
  if (!fd) exit(-1);
  ioctl(fd, KDSKBMODE, K_XLATE);
  ioctl(fd, KDSETMODE, KD_TEXT);
  struct vt_mode mode = { 0 };
  mode.mode = VT_AUTO;
  ioctl(fd, VT_SETMODE, &mode);
}
emersion commented 4 years ago

What happens to the privileged helper process when the compositor dies?

ghost commented 4 years ago

I'm still sorting that out. If you "kill -9" the unprivileged child the console gets stuck and the parent does not notice, so clearly we should improve that situation -- have the parent notice the child is dead, fix the console, then exit.

The question of the parent process dying is more complicated. Is there a list somewhere of the duties the parent process is responsible for after spawning the child? Or is the parent's job mainly to do privileged things before spawning the child?

If the privileged parent is still doing complicated things that could cause it to crash in a development scenario then it might make sense to have an off-by-default "developer mode" launch where there are three processes instead of two; the new process is the parent of the privileged process and does nothing but watch it for unexpected exit.

emersion commented 4 years ago

The child dying is unlikely -- the child is by design very simple.

Is there a list somewhere of the duties the parent process is responsible for after spawning the child? Or is the parent's job mainly to do privileged things before spawning the child?

https://github.com/swaywm/wlroots/blob/master/backend/session/direct-ipc.c#L128

If the parent dies, the child should too since it's reading the socketpair.

three processes instead of two

I think we can work out a solution without introducing a third process.

ghost commented 4 years ago

Oh sorry, I had that backwards: the privileged process is the child, not the parent. I should have said:

"If you "kill -9" the unprivileged parent the console gets stuck"

https://github.com/swaywm/wlroots/blob/master/backend/session/direct-ipc.c#L128

Thanks (not a lot of comments there though!). So, if I understand this correctly, the only thing the long-lived-root-process does is call drmSetMaster/drmDropMaster when asked to do so by the unprivileged child? I guess this is to allow for VT switching? Very weird that this is based on being root rather than being able to write to /dev/dri/cardX.

Is there a reason why the privileged process must be the child rather than the parent? If it were the parent it could respond to SIGCHLD by sending SIGTERM to itself, and respond to SIGTERM by cleaning up the console. That should be pretty reliable.

emersion commented 4 years ago

So, if I understand this correctly, the only thing the long-lived-root-process does is call drmSetMaster/drmDropMaster when asked to do so by the unprivileged child?

No, it's the other way around. The unprivileged parent is the compositor (and is likely to crash). The privileged child does the DRM stuff.

Is there a reason why the privileged process must be the child rather than the parent?

We don't want to do privileged operations in the parent (the compositor).

ghost commented 4 years ago

Is there a reason why the privileged process must be the child rather than the parent?

We don't want to do privileged operations in the parent (the compositor).

Yes, of course you don't want to do privileged operations in the compositor.

My question is why can't the unprivileged compositor be the child process, forked from the privileged non-compositor parent process. Basically change "if (pid < 0)" to "if (pid >= 0)" in direct_ipc_init().

The main point here is that we have very reliable mechanisms to find out that a child process died -- SIGCHLD works even if you're blocked reading from a pipe whose other end got wedged. So we'd like to use that mechanism to have the simple small unprivileged process (which is unlikely to crash due to bugs) find out that the other large complex compositor process died, and at least put the KMS/DRM console+keyboard into a usable state before bailing out.

I don't think POSIX provides any reliable mechanism for being notified that your parent process died, only that a child process died.

emersion commented 4 years ago

My question is why can't the unprivileged compositor be the child process, forked from the privileged non-compositor parent process. Basically change "if (pid < 0)" to "if (pid >= 0)" in direct_ipc_init().

Nah, this would be unexpected for this to happen behind the scenes. I also don't think it's necessary. The child already knows if the parent died, because the socket will be closed.

SIGCHLD

This relies on global state.

12101111 commented 4 years ago

I have encountered this problem. Sway crashed after I exit it using Ctrl-shift-e, and screen is frozen. I have to run startx in ssh to get a working X session. Log in dmesg:

[  525.828818]  sway[2200]: segfault at 41 ip 00007fdd70869420 sp 00007ffd3dbdd058 error 4 in libwayland-server.so.0.1.0[7fdd70863000+7000]
[  525.828831] Code: 08 c3 0f 1f 84 00 00 00 00 00 48 8b 47 08 48 89 3e 48 89 46 08 48 89 77 08 48 8b 46 08 48 89 30 c3 66 0f 1f 84 00 00 00 00 00 <48> 8b 17 48 8b 47 08 48 89 42 08 48 89 10 48 c7 47 08 00 00 00 00

I'm using x86_64-gentoo-linux-musl on a Intel Graphic laptop. Sway is launched using consolekit2

ascent12 commented 4 years ago

@12101111 consolekit2 was never something we have supported.