Open jsravn opened 4 years ago
This may be a regression in the mesa iris
driver, can you try with export MESA_LOADER_DRIVER_OVERRIDE=i965
to use the old i915 driver?
Edit: Also which kernel version are you using?
If your GPU isn't very old, you can try with i965.
Hmm, it seems like wlr_egl_make_current
is called with an invalid wlr_egl
pointer....
@Emantor Linux thebest 5.6.2-arch1-2 #1 SMP PREEMPT Sun, 05 Apr 2020 05:13:14 +0000 x86_64 GNU/Linux
. I'll try with the i915 driver.
Unfortunately that did not help. I was able to get it to crash after two tries. Verified I'm using i915:
glxinfo | grep Intel
Vendor: Intel Open Source Technology Center (0x8086)
Can you try with address sanitizer? meson build/ -Db_sanitize=address
this way we may be able to detect a race and can tell why wlr_egl
is an invalid pointer.
Think I got it... does this help?
Apr 09 17:31:00 thebest sway[27255]: 00:07:25.613 [backend/drm/drm.c:692] Starting renderer on output 'DP-1'
Apr 09 17:31:00 thebest sway[27255]: AddressSanitizer:DEADLYSIGNAL
Apr 09 17:31:00 thebest sway[27255]: =================================================================
Apr 09 17:31:00 thebest sway[27255]: ==27255==ERROR: AddressSanitizer: SEGV on unknown address 0x000000000028 (pc 0x7f7d7113d3dd bp 0x7ffec01052a0 sp 0x7ffec0105200 T0)
Apr 09 17:31:00 thebest sway[27255]: ==27255==The signal is caused by a READ memory access.
Apr 09 17:31:00 thebest sway[27255]: ==27255==Hint: address points to the zero page.
Apr 09 17:31:00 thebest sway[27255]: #0 0x7f7d7113d3dc in wlr_egl_make_current ../render/egl.c:379
Apr 09 17:31:00 thebest sway[27255]: #1 0x7f7d71157432 in get_drm_surface_front ../backend/drm/renderer.c:165
Apr 09 17:31:00 thebest sway[27255]: #2 0x7f7d7114c031 in drm_connector_pageflip_renderer ../backend/drm/drm.c:681
Apr 09 17:31:00 thebest sway[27255]: #3 0x7f7d7114c520 in drm_connector_start_renderer ../backend/drm/drm.c:695
Apr 09 17:31:00 thebest sway[27255]: #4 0x7f7d71150132 in enable_drm_connector ../backend/drm/drm.c:807
Apr 09 17:31:00 thebest sway[27255]: #5 0x7f7d71151019 in drm_connector_commit ../backend/drm/drm.c:554
Apr 09 17:31:00 thebest sway[27255]: #6 0x7f7d711c9292 in wlr_output_commit ../types/wlr_output.c:552
Apr 09 17:31:00 thebest sway[27255]: #7 0x55aa07b4f14d in apply_output_config ../sway/sway/config/output.c:419
Apr 09 17:31:00 thebest sway[27255]: #8 0x55aa07b4ff32 in apply_output_config_to_outputs ../sway/sway/config/output.c:624
Apr 09 17:31:00 thebest sway[27255]: #9 0x55aa07b67f8c in cmd_output ../sway/sway/commands/output.c:108
Apr 09 17:31:00 thebest sway[27255]: #10 0x55aa07af39e5 in execute_command ../sway/sway/commands.c:286
Apr 09 17:31:00 thebest sway[27255]: #11 0x55aa07b07c3c in ipc_client_handle_command ../sway/sway/ipc-server.c:647
Apr 09 17:31:00 thebest sway[27255]: #12 0x55aa07b07c3c in ipc_client_handle_command ../sway/sway/ipc-server.c:609
Apr 09 17:31:00 thebest sway[27255]: #13 0x55aa07b098a3 in ipc_client_handle_readable ../sway/sway/ipc-server.c:269
Apr 09 17:31:00 thebest sway[27255]: #14 0x7f7d71298fa9 in wl_event_loop_dispatch (/usr/lib/libwayland-server.so.0+0xafa9)
Apr 09 17:31:00 thebest sway[27255]: #15 0x7f7d712974e6 in wl_display_run (/usr/lib/libwayland-server.so.0+0x94e6)
Apr 09 17:31:00 thebest sway[27255]: #16 0x55aa07aef35d in main ../sway/sway/main.c:409
Apr 09 17:31:00 thebest sway[27255]: #17 0x7f7d70ebb022 in __libc_start_main (/usr/lib/libc.so.6+0x27022)
Apr 09 17:31:00 thebest sway[27255]: #18 0x55aa07af197d in _start (/usr/bin/sway+0x4197d)
Apr 09 17:31:00 thebest sway[27255]: AddressSanitizer can not provide additional info.
Apr 09 17:31:00 thebest sway[27255]: SUMMARY: AddressSanitizer: SEGV ../render/egl.c:379 in wlr_egl_make_current
Apr 09 17:31:00 thebest sway[27255]: ==27255==ABORTING
Unfortunately not. This is the same backtrace as above, we didn't gain any new data. If this would've been a use-after-free, the sanitizer would've caught the allocation and free of the object.
I didn't think there was much in there. Let me know if there is any further debugging info I can try to gather.
I'm seeing the same crashes, with an entirely different video card (Radeon HD 6500/6600) (sway version 1.4)
I'm not sure if I have more debug info that could be helpful, my stack trace is less detailed (it's not a debug build).
I got this crash again, and poked around a bit. I don't have a whole lot to add, but it seems that wlr_drm_renderer is 0. egl is 0x10 because that must just be the offset from 0. So it seems like the renderer isn't a random pointer, but a NULL pointer.
I'm not familiar with this code, so I'm not sure in what situations the renderer would be 0. I can try to take a look, but maybe that rings a bell for someone else.
I have the same issue with a Ryzen 5 and integrated graphics
Interesting, so I've been running with a debug build lately, and got a slightly different stack trace. I can't say for sure it's caused by the same issue, but it's related to the surface code:
#0 __GI_raise (sig=sig@entry=6) at ../sysdeps/unix/sysv/linux/raise.c:50
#1 0x00007f896eb0e857 in __GI_abort () at abort.c:79
#2 0x00007f896eb0e727 in __assert_fail_base (fmt=0x7f896ec780a8 "%s%s%s:%u: %s%sAssertion `%s' failed.\n%n", assertion=0x7f896e4bcf1e "false", file=0x7f896e4bd910 "types/xdg_shell/wlr_xdg_surface.c", line=360, function=<optimized out>) at assert.c:92
#3 0x00007f896eb1d426 in __GI___assert_fail (assertion=assertion@entry=0x7f896e4bcf1e "false", file=file@entry=0x7f896e4bd910 "types/xdg_shell/wlr_xdg_surface.c", line=line@entry=360, function=function@entry=0x7f896e4bdad0 <__PRETTY_FUNCTION__.7908> "handle_xdg_surface_commit") at assert.c:101
#4 0x00007f896e48f70c in handle_xdg_surface_commit (wlr_surface=<optimized out>) at ../wlroots-0.10.1/types/xdg_shell/wlr_xdg_surface.c:360
#5 0x00007f896e4a61aa in surface_commit_pending (surface=surface@entry=0x557b9ff23830) at ../wlroots-0.10.1/types/wlr_surface.c:370
#6 0x00007f896e4a64b8 in surface_commit (client=<optimized out>, resource=<optimized out>) at ../wlroots-0.10.1/types/wlr_surface.c:442
#7 0x00007f896df43a8d in () at /usr/lib/libffi.so.7
#8 0x00007f896df4301b in () at /usr/lib/libffi.so.7
#9 0x00007f896e4f7252 in wl_closure_invoke (closure=0x557b9fb40e50, flags=2, target=<optimized out>, opcode=6, data=<optimized out>) at ../wayland-1.18.0/src/connection.c:1018
#10 0x00007f896e4f3f83 in wl_client_connection_data (fd=<optimized out>, mask=<optimized out>, data=0x557b9f8ab980) at ../wayland-1.18.0/src/wayland-server.c:432
#11 0x00007f896e4f4d6c in wl_event_source_fd_dispatch (source=<optimized out>, ep=<optimized out>) at ../wayland-1.18.0/src/event-loop.c:112
#12 0x00007f896e4f5c7e in wl_event_loop_dispatch (loop=0x557b9eca1b30, timeout=timeout@entry=-1) at ../wayland-1.18.0/src/event-loop.c:1027
#13 0x00007f896e4f4178 in wl_display_run (display=0x557b9eca81f0) at ../wayland-1.18.0/src/wayland-server.c:1351
#14 0x0000557b9d5daa96 in server_run (server=0x557b9d604e40 <server>) at ../sway-1.4/sway/server.c:209
#15 0x0000557b9d5e5bec in main (argc=1, argv=0x7fff4206a898) at ../sway-1.4/sway/main.c:403
The assert was from line 360:
355 surface->geometry.height = surface->next_geometry.height;
356 }
357
358 switch (surface->role) {
359 case WLR_XDG_SURFACE_ROLE_NONE:
360 assert(false);
361 case WLR_XDG_SURFACE_ROLE_TOPLEVEL:
362 handle_xdg_surface_toplevel_committed(surface);
363 break;
364 case WLR_XDG_SURFACE_ROLE_POPUP:
Sorry to keep piling on here, I'm not sure if it helps. But a couple other details:
One is that I noticed that it's not just wlr_drm_renderer that is set to 0, all of the fields in wlr_drm_surface are 0. Given that finish_drm_surface and init_drm_surface (on failure) call memset(surf, 0, sizeof(*surf)), this makes me suspect that either something didn't check the reutrn code of init_drm_surface, or that finish_drm_surface was called and something tried to use it after.
The other thing is that I realized I never tried running the latest swaywm and wlroots versions in git, which was a bit silly. I'm doing that now, and so far I haven't had the same issue. But it looks like the code was restructured a fair amount, so I'm not sure if the issue was fixed, or if it's just not being triggered the same way.
If I can still manage to trigger it, I'll add some extra logging around the init and finish code.
@zlandau That's great advice, I'm also going to switch to the git
versions and see if my situation improves. I'm on a Thinkpad P50 (Intel + Nouveau hybrid graphics) and I have the exact same problem. I don't have any details to add beyond what @zlandau has already posted.
Thanks everybody for contributing towards this lovely project!
Run swayidle with -git too since I think it's involved with the problem.
I tried the git version and still had the crashes. So unless it's been fixed recently it will still happen. I suspect it is related to having multiple GPUs - when I disabled my internal gpu completely and swapped to a dedicated AMD card, the crashes went away.
sway version 1.4-3078f232 (Apr 6 2020, branch 'master')
I have swayidle configured to suspend the system after a while:
When it resumes, it sometimes crashes. From the backtrace it seems to be crashing when doing
output * dpms on
.Note this is a dual graphics machine, and I'm using intel graphics (also setting
WLR_DRM_DEVICES=/dev/dri/card0
in case, based on some prior issues).