swaywm / sway

i3-compatible Wayland compositor
https://swaywm.org
MIT License
14.54k stars 1.11k forks source link

sway crash on resume #5198

Open jsravn opened 4 years ago

jsravn commented 4 years ago

I have swayidle configured to suspend the system after a while:

exec swayidle -w \
  timeout 600 'swaymsg "output * dpms off"' \
    resume 'swaymsg "output * dpms on"' \
  timeout 900 '$lock' \
  timeout 1800 'systemctl suspend' \
  before-sleep '$lock' \
  timeout 30 'if pgrep swaylock; then swaymsg "output * dpms off"; fi' \
    resume 'swaymsg "output * dpms on"'

When it resumes, it sometimes crashes. From the backtrace it seems to be crashing when doing output * dpms on.

Note this is a dual graphics machine, and I'm using intel graphics (also setting WLR_DRM_DEVICES=/dev/dri/card0 in case, based on some prior issues).

#0  0x00007fc21f2df154 in wlr_egl_make_current (egl=0x10, surface=0x0, buffer_age=buffer_age@entry=0x0) at ../render/egl.c:378
#1  0x00007fc21f2e8ec4 in make_drm_surface_current (surf=surf@entry=0x55cb668d8878, buffer_damage=buffer_damage@entry=0x0) at ../backend/drm/renderer.c:142
#2  0x00007fc21f2e8f7b in get_drm_surface_front (surf=0x55cb668d8878) at ../backend/drm/renderer.c:163
        renderer = 
#3  0x00007fc21f2e4b71 in drm_connector_pageflip_renderer (conn=0x55cb6709b100, mode=0x55cb673ad4d0) at ../backend/drm/drm.c:601
        drm = 0x55cb668d7ac0
        crtc = 0x55cb668d7c90
        plane = 0x55cb668d8870
        bo = 
        fb_id = 
#4  0x00007fc21f2e4e77 in drm_connector_start_renderer (conn=0x55cb6709b100) at ../backend/drm/drm.c:615
        mode = 
#5  0x00007fc21f2e6621 in enable_drm_connector (output=output@entry=0x55cb6709b100, enable=) at ../backend/drm/drm.c:727
        conn = 0x55cb6709b100
        drm = 0x55cb668d7ac0
        ok = 
#6  0x00007fc21f2e6bc0 in drm_connector_commit (output=0x55cb6709b100) at ../backend/drm/drm.c:481
        drm = 
#7  drm_connector_commit (output=0x55cb6709b100) at ../backend/drm/drm.c:455
        drm = 
#8  0x00007fc21f318088 in wlr_output_commit (output=output@entry=0x55cb6709b100) at ../types/wlr_output.c:515
        enabled = 
        now = {tv_sec = 65620, tv_nsec = 604106635}
        event = {output = 0x55cb6709b100, when = 0x7ffdcbd7abe0}
        scale_updated = 
        geometry_updated = 
#9  0x000055cb65195f1b in apply_output_config (oc=oc@entry=0x55cb679074e0, output=output@entry=0x55cb67a33f50) at ../sway/sway/config/output.c:402
        wlr_output = 0x55cb6709b100
        was_enabled = true
        scale = 
        output_box = 
#10 0x000055cb65196a0b in apply_output_config_to_outputs (oc=0x55cb6706d110) at ../sway/sway/config/output.c:595
        current = 0x55cb679074e0
        name = 0x55cb6709b130 "DP-1"
        wildcard = true
        id = "Goldstar Company Ltd 32GK850G #ASPp7EV3e47d\000\000\000\000\000\324\300\033e\313U\000\000\000\000\000\000\000\000\000\000@X\206g\313U\000\000\243(\027e\313U\000\000PX\206g\313U\000\000@X\206g\313U\000\000\060\252\006g\313U\000\000\002\000\000\000\000\000\000\000@\273\035e\313U\000\000\020\001\000\000\000\000\000"
        sway_output = 0x55cb67a33f50
        tmp = 0x55cb670481f0
        seat = 
#11 0x000055cb651a129d in cmd_output (argc=0, argv=0x55cb67865850) at ../sway/sway/commands/output.c:108
        error = 
        output = 
        background = false
#12 0x000055cb65173538 in execute_command (_exec=_exec@entry=0x55cb673149c0 "output * dpms on", seat=0x55cb67048990, seat@entry=0x0, con=con@entry=0x0)
    at ../sway/sway/commands.c:286
        node = 
        res = 
        argc = 4
        argv = 0x55cb67865830
        handler = 0x55cb651db460 
        cmd = 
        matched_delim = 0 '\000'
        containers = 0x0
        __PRETTY_FUNCTION__ = "execute_command"
        exec = 0x55cb6709ef30 "output * dpms on"
        head = 0x0
        res_list = 0x55cb67460290
#13 0x000055cb6517c4ac in ipc_client_handle_command (payload_type=, payload_length=, client=0x55cb678660b0)
    at ../sway/sway/ipc-server.c:647
        line = 
        res_list = 
        json = 
        length = 
        buf = 0x55cb673149c0 "output * dpms on"
        __PRETTY_FUNCTION__ = "ipc_client_handle_command"
        __PRETTY_FUNCTION__ = "ipc_client_handle_command"
#14 ipc_client_handle_command (client=0x55cb678660b0, payload_length=16, payload_type=) at ../sway/sway/ipc-server.c:609
        __PRETTY_FUNCTION__ = "ipc_client_handle_command"
#15 0x000055cb6517d1b9 in ipc_client_handle_readable (mask=, data=0x55cb678660b0, client_fd=159) at ../sway/sway/ipc-server.c:269
        pending_length = 
        pending_type = 
        read_available = 30
        buf = "i3-ipc\020\000\000\000\000\000\000"
        buf32 = 0x7ffdcbd7afd0
        received = 
        client = 0x55cb678660b0
#16 ipc_client_handle_readable (client_fd=159, mask=, data=0x55cb678660b0) at ../sway/sway/ipc-server.c:205
        client = 0x55cb678660b0
#17 0x00007fc21f371faa in wl_event_loop_dispatch () at /usr/lib/libwayland-server.so.0
#18 0x00007fc21f3704e7 in wl_display_run () at /usr/lib/libwayland-server.so.0
#19 0x000055cb6517259d in main (argc=3, argv=0x7ffdcbd7b398) at ../sway/sway/main.c:409
        verbose = 0
        debug = 1
        validate = 0
        allow_unsupported_gpu = 1
        long_options = 
            {{name = 0x55cb651be4bb "help", has_arg = 0, flag = 0x0, val = 104}, {name = 0x55cb651c1c29 "config", has_arg = 1, flag = 0x0, val = 99}, {name = 0x55cb651be4c0 "validate", has_arg = 0, flag = 0x0, val = 67}, {name = 0x55cb651be4c9 "debug", has_arg = 0, flag = 0x0, val = 100}, {name = 0x55cb651be41f "version", has_arg = 0, flag = 0x0, val = 118}, {name = 0x55cb651bd59d "verbose", has_arg = 0, flag = 0x0, val = 86}, {name = 0x55cb651be4cf "get-socketpath", has_arg = 0, flag = 0x0, val = 112}, {name = 0x55cb651be4de "unsupported-gpu", has_arg = 0, flag = 0x0, val = 117}, {name = 0x55cb651be4ee "my-next-gpu-wont-be-nvidia", has_arg = 0, flag = 0x0, val = 117}, {name = 0x0, has_arg = 0, flag = 0x0, val = 0}}
        config_path = 0x0
        usage = 0x55cb651be860 "Usage: sway [options] [command]\n\n  -h, --help", ' ' , "Show help message and quit.\n  -c, --config   Specify a config file.\n  -C, --validate         Check the validity of the config file, th"...
        c = 
Emantor commented 4 years ago

This may be a regression in the mesa iris driver, can you try with export MESA_LOADER_DRIVER_OVERRIDE=i965 to use the old i915 driver? Edit: Also which kernel version are you using?

emersion commented 4 years ago

If your GPU isn't very old, you can try with i965.

emersion commented 4 years ago

Hmm, it seems like wlr_egl_make_current is called with an invalid wlr_egl pointer....

jsravn commented 4 years ago

@Emantor Linux thebest 5.6.2-arch1-2 #1 SMP PREEMPT Sun, 05 Apr 2020 05:13:14 +0000 x86_64 GNU/Linux. I'll try with the i915 driver.

jsravn commented 4 years ago

Unfortunately that did not help. I was able to get it to crash after two tries. Verified I'm using i915:

glxinfo | grep Intel
    Vendor: Intel Open Source Technology Center (0x8086)
Emantor commented 4 years ago

Can you try with address sanitizer? meson build/ -Db_sanitize=address this way we may be able to detect a race and can tell why wlr_egl is an invalid pointer.

jsravn commented 4 years ago

Think I got it... does this help?

Apr 09 17:31:00 thebest sway[27255]: 00:07:25.613 [backend/drm/drm.c:692] Starting renderer on output 'DP-1'
Apr 09 17:31:00 thebest sway[27255]: AddressSanitizer:DEADLYSIGNAL
Apr 09 17:31:00 thebest sway[27255]: =================================================================
Apr 09 17:31:00 thebest sway[27255]: ==27255==ERROR: AddressSanitizer: SEGV on unknown address 0x000000000028 (pc 0x7f7d7113d3dd bp 0x7ffec01052a0 sp 0x7ffec0105200 T0)
Apr 09 17:31:00 thebest sway[27255]: ==27255==The signal is caused by a READ memory access.
Apr 09 17:31:00 thebest sway[27255]: ==27255==Hint: address points to the zero page.
Apr 09 17:31:00 thebest sway[27255]:     #0 0x7f7d7113d3dc in wlr_egl_make_current ../render/egl.c:379
Apr 09 17:31:00 thebest sway[27255]:     #1 0x7f7d71157432 in get_drm_surface_front ../backend/drm/renderer.c:165
Apr 09 17:31:00 thebest sway[27255]:     #2 0x7f7d7114c031 in drm_connector_pageflip_renderer ../backend/drm/drm.c:681
Apr 09 17:31:00 thebest sway[27255]:     #3 0x7f7d7114c520 in drm_connector_start_renderer ../backend/drm/drm.c:695
Apr 09 17:31:00 thebest sway[27255]:     #4 0x7f7d71150132 in enable_drm_connector ../backend/drm/drm.c:807
Apr 09 17:31:00 thebest sway[27255]:     #5 0x7f7d71151019 in drm_connector_commit ../backend/drm/drm.c:554
Apr 09 17:31:00 thebest sway[27255]:     #6 0x7f7d711c9292 in wlr_output_commit ../types/wlr_output.c:552
Apr 09 17:31:00 thebest sway[27255]:     #7 0x55aa07b4f14d in apply_output_config ../sway/sway/config/output.c:419
Apr 09 17:31:00 thebest sway[27255]:     #8 0x55aa07b4ff32 in apply_output_config_to_outputs ../sway/sway/config/output.c:624
Apr 09 17:31:00 thebest sway[27255]:     #9 0x55aa07b67f8c in cmd_output ../sway/sway/commands/output.c:108
Apr 09 17:31:00 thebest sway[27255]:     #10 0x55aa07af39e5 in execute_command ../sway/sway/commands.c:286
Apr 09 17:31:00 thebest sway[27255]:     #11 0x55aa07b07c3c in ipc_client_handle_command ../sway/sway/ipc-server.c:647
Apr 09 17:31:00 thebest sway[27255]:     #12 0x55aa07b07c3c in ipc_client_handle_command ../sway/sway/ipc-server.c:609
Apr 09 17:31:00 thebest sway[27255]:     #13 0x55aa07b098a3 in ipc_client_handle_readable ../sway/sway/ipc-server.c:269
Apr 09 17:31:00 thebest sway[27255]:     #14 0x7f7d71298fa9 in wl_event_loop_dispatch (/usr/lib/libwayland-server.so.0+0xafa9)
Apr 09 17:31:00 thebest sway[27255]:     #15 0x7f7d712974e6 in wl_display_run (/usr/lib/libwayland-server.so.0+0x94e6)
Apr 09 17:31:00 thebest sway[27255]:     #16 0x55aa07aef35d in main ../sway/sway/main.c:409
Apr 09 17:31:00 thebest sway[27255]:     #17 0x7f7d70ebb022 in __libc_start_main (/usr/lib/libc.so.6+0x27022)
Apr 09 17:31:00 thebest sway[27255]:     #18 0x55aa07af197d in _start (/usr/bin/sway+0x4197d)
Apr 09 17:31:00 thebest sway[27255]: AddressSanitizer can not provide additional info.
Apr 09 17:31:00 thebest sway[27255]: SUMMARY: AddressSanitizer: SEGV ../render/egl.c:379 in wlr_egl_make_current
Apr 09 17:31:00 thebest sway[27255]: ==27255==ABORTING
Emantor commented 4 years ago

Unfortunately not. This is the same backtrace as above, we didn't gain any new data. If this would've been a use-after-free, the sanitizer would've caught the allocation and free of the object.

jsravn commented 4 years ago

I didn't think there was much in there. Let me know if there is any further debugging info I can try to gather.

zlandau commented 4 years ago

I'm seeing the same crashes, with an entirely different video card (Radeon HD 6500/6600) (sway version 1.4)

I'm not sure if I have more debug info that could be helpful, my stack trace is less detailed (it's not a debug build).

zlandau commented 4 years ago

I got this crash again, and poked around a bit. I don't have a whole lot to add, but it seems that wlr_drm_renderer is 0. egl is 0x10 because that must just be the offset from 0. So it seems like the renderer isn't a random pointer, but a NULL pointer.

I'm not familiar with this code, so I'm not sure in what situations the renderer would be 0. I can try to take a look, but maybe that rings a bell for someone else.

Qubasa commented 4 years ago

I have the same issue with a Ryzen 5 and integrated graphics

zlandau commented 4 years ago

Interesting, so I've been running with a debug build lately, and got a slightly different stack trace. I can't say for sure it's caused by the same issue, but it's related to the surface code:

#0  __GI_raise (sig=sig@entry=6) at ../sysdeps/unix/sysv/linux/raise.c:50
#1  0x00007f896eb0e857 in __GI_abort () at abort.c:79
#2  0x00007f896eb0e727 in __assert_fail_base (fmt=0x7f896ec780a8 "%s%s%s:%u: %s%sAssertion `%s' failed.\n%n", assertion=0x7f896e4bcf1e "false", file=0x7f896e4bd910 "types/xdg_shell/wlr_xdg_surface.c", line=360, function=<optimized out>) at assert.c:92
#3  0x00007f896eb1d426 in __GI___assert_fail (assertion=assertion@entry=0x7f896e4bcf1e "false", file=file@entry=0x7f896e4bd910 "types/xdg_shell/wlr_xdg_surface.c", line=line@entry=360, function=function@entry=0x7f896e4bdad0 <__PRETTY_FUNCTION__.7908> "handle_xdg_surface_commit") at assert.c:101
#4  0x00007f896e48f70c in handle_xdg_surface_commit (wlr_surface=<optimized out>) at ../wlroots-0.10.1/types/xdg_shell/wlr_xdg_surface.c:360
#5  0x00007f896e4a61aa in surface_commit_pending (surface=surface@entry=0x557b9ff23830) at ../wlroots-0.10.1/types/wlr_surface.c:370
#6  0x00007f896e4a64b8 in surface_commit (client=<optimized out>, resource=<optimized out>) at ../wlroots-0.10.1/types/wlr_surface.c:442
#7  0x00007f896df43a8d in  () at /usr/lib/libffi.so.7
#8  0x00007f896df4301b in  () at /usr/lib/libffi.so.7
#9  0x00007f896e4f7252 in wl_closure_invoke (closure=0x557b9fb40e50, flags=2, target=<optimized out>, opcode=6, data=<optimized out>) at ../wayland-1.18.0/src/connection.c:1018
#10 0x00007f896e4f3f83 in wl_client_connection_data (fd=<optimized out>, mask=<optimized out>, data=0x557b9f8ab980) at ../wayland-1.18.0/src/wayland-server.c:432
#11 0x00007f896e4f4d6c in wl_event_source_fd_dispatch (source=<optimized out>, ep=<optimized out>) at ../wayland-1.18.0/src/event-loop.c:112
#12 0x00007f896e4f5c7e in wl_event_loop_dispatch (loop=0x557b9eca1b30, timeout=timeout@entry=-1) at ../wayland-1.18.0/src/event-loop.c:1027
#13 0x00007f896e4f4178 in wl_display_run (display=0x557b9eca81f0) at ../wayland-1.18.0/src/wayland-server.c:1351
#14 0x0000557b9d5daa96 in server_run (server=0x557b9d604e40 <server>) at ../sway-1.4/sway/server.c:209
#15 0x0000557b9d5e5bec in main (argc=1, argv=0x7fff4206a898) at ../sway-1.4/sway/main.c:403

The assert was from line 360:

355         surface->geometry.height = surface->next_geometry.height;
356     }
357 
358     switch (surface->role) {
359     case WLR_XDG_SURFACE_ROLE_NONE:
360         assert(false);
361     case WLR_XDG_SURFACE_ROLE_TOPLEVEL:
362         handle_xdg_surface_toplevel_committed(surface);
363         break;
364     case WLR_XDG_SURFACE_ROLE_POPUP:
zlandau commented 4 years ago

Sorry to keep piling on here, I'm not sure if it helps. But a couple other details:

One is that I noticed that it's not just wlr_drm_renderer that is set to 0, all of the fields in wlr_drm_surface are 0. Given that finish_drm_surface and init_drm_surface (on failure) call memset(surf, 0, sizeof(*surf)), this makes me suspect that either something didn't check the reutrn code of init_drm_surface, or that finish_drm_surface was called and something tried to use it after.

The other thing is that I realized I never tried running the latest swaywm and wlroots versions in git, which was a bit silly. I'm doing that now, and so far I haven't had the same issue. But it looks like the code was restructured a fair amount, so I'm not sure if the issue was fixed, or if it's just not being triggered the same way.

If I can still manage to trigger it, I'll add some extra logging around the init and finish code.

DawidLoubser commented 4 years ago

@zlandau That's great advice, I'm also going to switch to the git versions and see if my situation improves. I'm on a Thinkpad P50 (Intel + Nouveau hybrid graphics) and I have the exact same problem. I don't have any details to add beyond what @zlandau has already posted.

Thanks everybody for contributing towards this lovely project!

travankor commented 4 years ago

Run swayidle with -git too since I think it's involved with the problem.

jsravn commented 4 years ago

I tried the git version and still had the crashes. So unless it's been fixed recently it will still happen. I suspect it is related to having multiple GPUs - when I disabled my internal gpu completely and swapped to a dedicated AMD card, the crashes went away.