stumpwm / mahogany

A stumpwm like Wayland compositor
GNU General Public License v2.0
220 stars 12 forks source link

Mahogany fails to run on FreeBSD with the GLES2 backend due to division by zero #45

Closed colfrog closed 8 months ago

colfrog commented 8 months ago

The WM runs fine under X11 when the pixman renderer is forced, but with GLES2 it fails in run-server, specifically in heart's hrt_server_start. It initializes the state correctly and is using amdgpu, reaches Initialized heart state. The last wlroots log is 00:00:00.138 [render/gles2/renderer.c:149] Created GL FBO for buffer 1024x768.

Then I get the division by zero error in hrt-server-start. The first stack trace:

Unhandled DIVISION-BY-ZERO in thread #<SB-THREAD:THREAD "main thread" RUNNING
                                        {1002CC8003}>:
  arithmetic error DIVISION-BY-ZERO signalled

Backtrace for: #<SB-THREAD:THREAD "main thread" RUNNING {1002CC8003}>
0: ("bogus stack frame")
1: ("foreign function: radeon_drm_winsys_create")
2: ("foreign function: radeon_drm_winsys_create")
3: ("foreign function: __driDriverGetExtensions_zink")
4: ("foreign function: __driDriverGetExtensions_zink")
5: ("foreign function: radeon_drm_winsys_create")
6: ("foreign function: __driDriverGetExtensions_zink")
7: ("foreign function: __driDriverGetExtensions_zink")
8: ("foreign function: __driDriverGetExtensions_zink")
9: ("foreign function: __driDriverGetExtensions_zink")
10: ("foreign function: __driDriverGetExtensions_zink")
11: ("foreign function: wlr_gles2_renderer_get_current_fbo")
12: ("foreign function: wlr_renderer_destroy")
13: ("foreign function: wlr_output_init_render")
14: ("foreign function: wlr_output_commit")
15: ("foreign function: hrt_output_init")
16: ("foreign function: wl_signal_emit_mutable")
17: ("foreign function: wlr_output_send_frame")
18: ("foreign function: wlr_output_update_needs_frame")
19: ("foreign function: wl_event_loop_dispatch")
20: ("foreign function: wl_display_run")
21: ("foreign function: hrt_server_start")
22: (MAHOGANY/CORE:HRT-SERVER-START :INVALID-VALUE-FOR-UNESCAPED-REGISTER-STORAGE)
23: (MAHOGANY::RUN-SERVER)
24: ((LAMBDA NIL :IN UIOP/IMAGE:RESTORE-IMAGE))
25: (UIOP/IMAGE:CALL-WITH-FATAL-CONDITION-HANDLER #<FUNCTION (LAMBDA NIL :IN UIOP/IMAGE:RESTORE-IMAGE) {1002CAA46B}>)
26: ((FLET SB-UNIX::BODY :IN SB-IMPL::START-LISP))
27: ((FLET "WITHOUT-INTERRUPTS-BODY-3" :IN SB-IMPL::START-LISP))
28: (SB-IMPL::%START-LISP)

Then it fails again in hrt-server-finish. The second stack trace:

Unhandled DIVISION-BY-ZERO in thread #<SB-THREAD:THREAD "main thread" RUNNING
                                        {1002CC8003}>:
  arithmetic error DIVISION-BY-ZERO signalled

Backtrace for: #<SB-THREAD:THREAD "main thread" RUNNING {1002CC8003}>
0: ("bogus stack frame")
1: ("foreign function: radeon_drm_winsys_create")
2: ("foreign function: radeon_drm_winsys_create")
3: ("foreign function: __driDriverGetExtensions_zink")
4: ("foreign function: __driDriverGetExtensions_zink")
5: ("foreign function: radeon_drm_winsys_create")
6: ("foreign function: __driDriverGetExtensions_zink")
7: ("foreign function: __driDriverGetExtensions_zink")
8: ("foreign function: __driDriverGetExtensions_zink")
9: ("foreign function: __driDriverGetExtensions_zink")
10: ("foreign function: __driDriverGetExtensions_zink")
11: ("foreign function: wlr_gles2_renderer_get_current_fbo")
12: ("foreign function: wlr_renderer_destroy")
13: ("foreign function: wlr_output_init_render")
14: ("foreign function: wlr_output_destroy")
15: ("foreign function: wlr_x11_backend_create")
16: ("foreign function: wlr_backend_destroy")
17: ("foreign function: wlr_multi_for_each_backend")
18: ("foreign function: wlr_multi_backend_create")
19: ("foreign function: wl_display_destroy")
20: ("foreign function: hrt_server_finish")
21: (MAHOGANY/CORE:HRT-SERVER-FINISH #.(SB-SYS:INT-SAP #X2B149F207E38))
22: ((FLET "CLEANUP-FUN-47" :IN MAHOGANY::RUN-SERVER)) [cleanup]
23: (MAHOGANY::RUN-SERVER)
24: ((LAMBDA NIL :IN UIOP/IMAGE:RESTORE-IMAGE))
25: (UIOP/IMAGE:CALL-WITH-FATAL-CONDITION-HANDLER #<FUNCTION (LAMBDA NIL :IN UIOP/IMAGE:RESTORE-IMAGE) {1002CAA46B}>)
26: ((FLET SB-UNIX::BODY :IN SB-IMPL::START-LISP))
27: ((FLET "WITHOUT-INTERRUPTS-BODY-3" :IN SB-IMPL::START-LISP))
28: (SB-IMPL::%START-LISP)

What's weird to me is that it calls radeon_drm_winsys_create, while I'm on an amdgpu card. It should be calling amdgpu_winsys_create instead. But EGL was initialized for AMDGPU.

I think that this is an issue with Mahogany because sway runs just fine on X11 with the GLES2 renderer.

This seems to be an issue with the hrt_server in heart and the way it initializes the renderer and DRM.

sdilts commented 8 months ago

Does it work if you turn off SBCL's floating point exception handling? Add #+SBCL (sb-int:set-floating-point-modes :traps nil) to the first line of of the run-server function in main.lisp:

(defun run-server ()
  #+sbcl
  (sb-int:set-floating-point-modes :traps nil)
  ; rest of the code

I've actually ran into this problem before: https://github.com/swaywm/wlroots/issues/1170. SBCL uses stricter floating point error handling than is on normally, which triggers exceptions in code that runs fine for other programs. We will need to double-check our code to make sure it isn't doing anything to cause this, but my guess is that it is an "issue" with the radeon driver or wlroots.

You can check for certain by running sway, tinywl, or another wlroots-based compositor under gdb with floating point exception handling turned on. If the exception occurs, it's probably not something we are doing.

We will probably run into this again, so just turning off FPE traps is probably the way to go.

colfrog commented 8 months ago

Disabling traps on floating point exceptions works! I can run mahogany with the GLES2 renderer on X11 without issues.

But, opening Sway in LLDB after disabling pass on SIGFPE still works, so there is likely an issue with the hrt_server, but I can't begin to understand what it is.

Running mahogany with LLDB (and GDB) produces segmentation faults (Invalid permissions for mapped object). What is your recommended way to debug heart?

sdilts commented 8 months ago

I've honestly not had to debug an issue like this; last time, I just turned the signal trap off and left it as is.

I just remembered that there is an example file at https://github.com/stumpwm/mahogany/blob/master/heart/example/main.c that does the bare minimum initialization (executable should be at build/heart/example/). You could try using that to narrow down the problem, but I bet it won't trigger the exception and you'll have to manually walk through stack frames. If it does work, I'd just attribute the issue to how CL is more strict about floating point behavior than C.

For regular lisp code, I run mahogany interactively and use the debugger built into SBCL. Examining the stack frames might give you more information, but I don't think SBCL can read C debugger information. If you wanted to try something completely different, Clasp is supposed to play really nicely with gdb, but I haven't tried it myself and don't know if it runs under BSD.

colfrog commented 8 months ago

The heart example does runs without issues even when SIGFPE set to nopass. I think #46 is the right approach to solving this.

This means that Lisp is more sensitive to floating point exceptions even when C programs don't emit a signal.

sdilts commented 8 months ago

Closed via #46.