swaywm / sway

i3-compatible Wayland compositor
https://swaywm.org
MIT License
14.69k stars 1.11k forks source link

WLR_DRM_DEVICES - issues with nouveau + intel #5642

Open d4g opened 4 years ago

d4g commented 4 years ago

Please fill out the following:

emersion commented 4 years ago

I get the error 21:08:15 - [main.c:521] Missing a required Wayland interface

This error is coming from a Wayland client, not from Sway. Sway trips after:

00:00:01.124 [DEBUG] [backend/drm/drm.c:685] Initializing renderer on connector 'DP-1'

Are you sure there isn't any segfault happening? Can you check coredumpctl -r?

d4g commented 4 years ago

You are right, there is a core dump. I generated a new one.

emersion commented 4 years ago

We can't do a lot with a coredump, since we don't have the exact same executables and libraries as your system. Can you do coredumpctl gdb and then bt full?

d4g commented 4 years ago
gdb) bt full
#0  0x00007f08a0021374 in ?? ()
No symbol table info available.
#1  0xffffffffffffffff in ?? ()
No symbol table info available.
#2  0xffffffffffffffff in ?? ()
No symbol table info available.
#3  0xffffffffffffffff in ?? ()
No symbol table info available.
#4  0xffffffffffffffff in ?? ()
No symbol table info available.
#5  0xffffffffffffffff in ?? ()
No symbol table info available.
#6  0xffffffffffffffff in ?? ()
No symbol table info available.
#7  0xffffffffffffffff in ?? ()
No symbol table info available.
#8  0xffffffffffffffff in ?? ()
No symbol table info available.
#9  0x3a360b6100000000 in ?? ()
No symbol table info available.
#10 0x0000000000000000 in ?? ()
No symbol table info available.

Program terminated with signal SIGBUS, Bus error.

martinetd commented 4 years ago

We'll need you to rebuild with debug infos to get anything useful out of it, this isn't much better than the core dump :)

d4g commented 4 years ago

This will take some time. I will look into it.

emersion commented 4 years ago

Here are some instructions to compile from source: https://github.com/swaywm/sway/wiki/Development-Setup#compiling-as-a-subproject

d4g commented 4 years ago

So I built sway myself using nix. I set the mesonbuildtype="debugoptimized" and the Ddebug=true flag. The resulting binary is actually now correct: sway: ELF 64-bit LSB executable, x86-64, version 1 (SYSV), dynamically linked, interpreter /nix/store/2pi6zgkwnr3zdslvlv16nixpzvbyjx1n-glibc-2.31/lib/ld-linux-x86-64.so.2, for GNU/Linux 2.6.32, with debug_info, not stripped

not stripped and with debug_info. However, I cannot get gdb to actually correctly display the symbols.

I also get the following message in gdb:

Reading symbols from /nix/store/1lmhq04w17ccn0g8hw15w3il6ybm7ydw-sway-unwrapped-1.5/bin/sway...

warning: core file may not match specified executable file.

I have no clue what goes wrong.

d4g commented 4 years ago

The warning is because I specified a symlink instead of the binary in the launchscript. Using the binary itself results in the same issue without the warning.

d4g commented 4 years ago

I think it’s because the shared libs are stripped. I try to go down the rabbit whole.

martinetd commented 4 years ago

Actually hold it. I didn't take the time earlier but looking at the stack now it's full of garbage, it's not a missing symbols problem.

Can you rebuild with -Db_sanitize=address as well? hopefully it should suffice. If not try to run with valgrind.

d4g commented 4 years ago

I did as you suggested @martinetd . Now, it still dumps but coredumpctl says that there is no core dump file created. Was this expected? Under the column “corefile” it says none.

Emantor commented 4 years ago

The ASAN information is contained in the debug log.

d4g commented 4 years ago

Do you mean the regulard sway debug log?

I created a new one with the new build: https://gist.github.com/d4g/1ebda4ce87d9e827cc00cff28f0cd5c2

I don't see much difference.

martinetd commented 4 years ago

It's normal core dumps get disabled when asan is turned on (dumps get really huge so they disable it), but you should get something like what has been pasted on #5325 for example (==2335==ERROR: AddressSanitizer: etc etc); if you don't get anything then it just wasn't a crash asan caught, it can happen if the corruption is inside a lib or something that hasn't been instrumented.

In this case I would recommend going back to the previous rebuild (without asan, with debug infos) and just run sway through valgrind -- it will be slower but since you can reproduce almost immediately it should work well enough for this case.

Thanks for going through all these hoops

d4g commented 4 years ago

I don't see any useful output using the vgcore from valgrind with gdb:

gdb /nix/store/1lmhq04w17ccn0g8hw15w3il6ybm7ydw-sway-unwrapped-1.5/bin/sway vgcore.2386 
[...]
Reading symbols from /nix/store/1lmhq04w17ccn0g8hw15w3il6ybm7ydw-sway-unwrapped-1.5/bin/sway...
[...]
Core was generated by `'.
Program terminated with signal SIGBUS, Bus error.
#0  0x000000000efc2390 in ?? ()
[Current thread is 1 (Thread 0x1382c700 (LWP 2518))]
(gdb) bt full
#0  0x000000000efc2390 in ?? ()
No symbol table info available.
#1  0xffffffff00000000 in ?? ()
No symbol table info available.
#2  0x0000000000000000 in ?? ()
No symbol table info available.
martinetd commented 4 years ago

That's already after the stack has been messed up (the 0xffffffff00000000 really doesn't make any sense), there might be something in the valgrind report on stderr though?

... well, I was sure valgrind did memory bounds access checks, but I just ran through a simple test program and it didn't catch a basic overflow so I can probably go back to bed. Sorry for the bad suggestion. if asan doesn't catch it then I'm not sure what would help -- you can try compiling with -fstack-protector-strong maybe? but I'm not sure it'll catch much more than asan.. If someone else has a better idea feel free to suggest something else, I pass :(

d4g commented 4 years ago

Strace output directly before the coredump:

sysinfo({uptime=43, loads=[91072, 22816, 7648], totalram=33518514176, freeram=32192102400, sharedram=38318080, bufferram=618496, totalswap=34359734272, freeswap=34359734272, procs=349, totalhigh=0, freehigh=0, mem_unit=1}) = 0
brk(0x308e000)                          = 0x308e000
mmap(NULL, 4096, PROT_READ|PROT_WRITE, MAP_PRIVATE|MAP_ANONYMOUS, -1, 0) = 0x7f1bf42ed000
mmap(NULL, 4096, PROT_READ|PROT_WRITE, MAP_PRIVATE|MAP_ANONYMOUS, -1, 0) = 0x7f1bf42ec000
mprotect(0x7f1bf42ed000, 4096, PROT_READ|PROT_EXEC) = 0
mprotect(0x7f1bf42ec000, 4096, PROT_READ|PROT_EXEC) = 0
mprotect(0x7f1bf42ec000, 4096, PROT_READ|PROT_EXEC) = 0
mmap(NULL, 4096, PROT_READ|PROT_WRITE, MAP_PRIVATE|MAP_ANONYMOUS, -1, 0) = 0x7f1bf42eb000
mmap(NULL, 4096, PROT_READ|PROT_WRITE, MAP_PRIVATE|MAP_ANONYMOUS, -1, 0) = 0x7f1bf42ea000
mprotect(0x7f1bf42eb000, 4096, PROT_READ|PROT_EXEC) = 0
mprotect(0x7f1bf42ea000, 4096, PROT_READ|PROT_EXEC) = 0
mprotect(0x7f1bf42ea000, 4096, PROT_READ|PROT_EXEC) = 0
ioctl(15, DRM_IOCTL_MODE_MAP_DUMB, 0x7ffc25bd2fd0) = 0
mmap(NULL, 16777216, PROT_READ, MAP_SHARED, 15, 0x101035000) = 0x7f1b77000000
mmap(NULL, 10485760, PROT_READ|PROT_WRITE|PROT_EXEC, MAP_PRIVATE|MAP_ANONYMOUS, -1, 0) = 0x7f1b76600000
brk(0x30af000)                          = 0x30af000
mmap(NULL, 4096, PROT_READ|PROT_WRITE, MAP_PRIVATE|MAP_ANONYMOUS, -1, 0) = 0x7f1bf42e9000
mmap(NULL, 4096, PROT_READ|PROT_WRITE, MAP_PRIVATE|MAP_ANONYMOUS, -1, 0) = 0x7f1bf42e8000
mprotect(0x7f1bf42e9000, 4096, PROT_READ|PROT_EXEC) = 0
mprotect(0x7f1bf42e8000, 4096, PROT_READ|PROT_EXEC) = 0
mprotect(0x7f1bf42e8000, 4096, PROT_READ|PROT_EXEC) = 0
brk(0x30d4000)                          = 0x30d4000
getpid()                                = 2995
getpid()                                = 2995
futex(0x243c8b8, FUTEX_WAKE_PRIVATE, 1) = 1
futex(0x243c868, FUTEX_WAKE_PRIVATE, 1) = 1
futex(0x243ca18, FUTEX_WAKE_PRIVATE, 1) = 1
futex(0x243cb78, FUTEX_WAKE_PRIVATE, 1) = 1
futex(0x243cb28, FUTEX_WAKE_PRIVATE, 1) = 1
futex(0x243ccd8, FUTEX_WAKE_PRIVATE, 1) = 1
futex(0x243ce38, FUTEX_WAKE_PRIVATE, 1) = 1
futex(0x243cde8, FUTEX_WAKE_PRIVATE, 1) = 1
futex(0x243cf98, FUTEX_WAKE_PRIVATE, 1) = 1
futex(0x243d0f8, FUTEX_WAKE_PRIVATE, 1) = 1
futex(0x243d0a8, FUTEX_WAKE_PRIVATE, 1) = 1
futex(0x243d258, FUTEX_WAKE_PRIVATE, 1) = 1
futex(0x243d3b8, FUTEX_WAKE_PRIVATE, 1) = 1
futex(0x243d368, FUTEX_WAKE_PRIVATE, 1) = 1
futex(0x243d518, FUTEX_WAKE_PRIVATE, 1) = 1
futex(0x243d678, FUTEX_WAKE_PRIVATE, 1) = 1
futex(0x243d628, FUTEX_WAKE_PRIVATE, 1) = 1
futex(0x243d7d8, FUTEX_WAKE_PRIVATE, 1) = 1
futex(0x243c918, FUTEX_WAIT_PRIVATE, 0, NULL) = ?
+++ killed by SIGBUS (core dumped) +++
d4g commented 4 years ago

If I start the binary with the -fstack-protector-strong parameter, I get the following output:

00:00:00.640 [sway/config/output.c:717] execvp failed: No such file or directory
WARNING: Kernel has no file descriptor comparison support: Function not implemented
d4g commented 4 years ago

Just for clarification: Running only on the iGPU still works, that is how I actually pasted the message.

Geo25rey commented 4 years ago

Does commenting out any output commands from your config file change the behavior?

I believe this has something to do with loading the background image.

d4g commented 4 years ago

@Geo25rey I changed it to the default config. The execvp message disappears, which is still weird. There is only one output line output * bg ~/pictures/wallapaper/XMG_GRID_Wallpaper_2017_IMAGE_4K.jpg stretch which also works if I start sway on the iGPU only.

The error message WARNING: Kernel has no file descriptor comparison support: Function not implemented stays. It still core dumps btw.

d4g commented 4 years ago

The part that we are not discussing about right now, is that if I start sway using the dGPU as first entry in the WLR_DRM_DEVICES, sway does indeed startup and kind of works, at least as long as I only use programs on the iGPU connected display. But it lags when transitioning the mouse cursor from the iGPU display to the dGPU display. Also it starts to take up all CPU ressources as soon as I start opening windows on the dGPU connected display. This does not make sense to me as well. As the dGPU is handling all rendering in this scenario, why would it lag when output is generated on the display that is directly connected to the dGPU?

emersion commented 4 years ago

As the dGPU is handling all rendering in this scenario, why would it lag when output is generated on the display that is directly connected to the dGPU?

The rendering happens on the iGPU, but OpenGL (llvmpipe) is still used to perform buffer copies on the dGPU for scan-out.

d4g commented 4 years ago

@emersion i thought the 1st item in WLR_DRM_DEVICES does the rendering?

So WLR_DRM_DEVICES $iGPU:$dGPU sway would render on the iGPU and use llvmpipe to copy the buffer to the dGPU. This is the scenario that’s dumping the core.

Meanwhile WLR_DRM_DEVICES $dGPU:$iGPU sway Should render on the dGPU and copy the buffer to the iGPU. This scenario has the extreme performance issues when running programs on the display connected to the dGPU while everything on the iGPU connected display works.

emersion commented 4 years ago

Yes, but we still need to copy the rendered buffer to the scanout GPU.

emersion commented 4 years ago

Ideas to improve this: https://github.com/swaywm/wlroots/issues/1347

d4g commented 4 years ago

@emersion I still don’t get why this would cause the slowdown on the scenario 2 of my last comment. Can you try to explain to me on this example why and how the slowdown occurs only on the dGPU display if the dGPU does all the rendering?

d4g commented 4 years ago

Does anybody have any more ideas how to debug this?

emersion commented 4 years ago

I still don’t get why this would cause the slowdown on the scenario 2 of my last comment.

I don't know either.

Geo25rey commented 4 years ago

@d4g Can you describe your hardware configuration?

Laptop or desktop? Built in dedicated graphics or external dedicated graphics? If external, thunderbolt or pcie? CPU? If not custom built, make and model?

d4g commented 4 years ago

Laptop or desktop?

Laptop

Built in dedicated graphics or external dedicated graphics?

Built in internal graphix with dgpu via nvidia optimus. The Thunderbolt3/USB-C port is connected directly to the dgpu as well as the hdmi port

If external, thunderbolt or pcie?

nope

CPU?

Intel Core i7-9750H

If not custom built, make and model?

Intel-TongFang QC7 / XMG Fusion 15 with RTX2070 https://www.xmg.gg/en/xmg-fusion-15

Geo25rey commented 4 years ago

@d4g Maybe try using the swictheroo driver mentioned here

Geo25rey commented 4 years ago

The Thunderbolt3/USB-C port is connected directly to the dgpu as well as the hdmi port

@d4g What does this mean?

Geo25rey commented 4 years ago

What happens if you don't set the environment variable "WLR_DRM_DEVICES"?

d4g commented 4 years ago

The Thunderbolt3/USB-C port is connected directly to the dgpu as well as the hdmi port

@d4g What does this mean?

That both connectors for external displays are connected to the dGPU and the internal display is connected to the iGPU.

d4g commented 4 years ago

@d4g Maybe try using the swictheroo driver mentioned here

Sadly, this won’t help, as the laptop does not have a hardware mux and the external ports are connected to the dGPU.

Geo25rey commented 4 years ago

Perhaps the nouveau driver just isn't there yet. Have you opened an issue there?

d4g commented 4 years ago

What happens if you don't set the environment variable "WLR_DRM_DEVICES"?

Just tried. It behaves as if I specify the iGPU first.

I still don't know where this error message comes from:

WARNING: Kernel has no file descriptor comparison support: Function not implemented
d4g commented 4 years ago

Perhaps the nouveau driver just isn't there yet. Have you opened an issue there?

Not yet. Every time I look at there web presence, I get scared.

https://nouveau.freedesktop.org/wiki/ has last seen news posted January 2019.

I think that might be the right place to submit a bug, after searching for 30 minutes: https://gitlab.freedesktop.org/drm/nouveau/-/issues

If that's the case, there is only 6 bugs and they have 0 comments. Is there another repo?

Geo25rey commented 4 years ago

The nouveau driver was last updated 28 days ago in the Linux kernel. The libdrm portion of the driver is included with the Mesa driver, last updated 4 months ago. And, although irrelevant, the xf86-video-nouveau driver was last updated 3 weeks ago. So, you shouldn't have anything to worry about activity-wise.

Geo25rey commented 4 years ago

Do you have a recent version of Mesa installed?

d4g commented 4 years ago

mesa-20.1.4 and libdrm-2.4.102

d4g commented 4 years ago

In which repo would you suggest to submit the issue?

Geo25rey commented 4 years ago

I would assume the Mesa repo I linked

d4g commented 4 years ago

But what should I post there? There is no dedicated nouveau error message. The only thing we see is that the stack of sway gets garbled.

Probably #1347 would also eliminate the error on the sway side.

d4g commented 4 years ago

Submitted issue https://gitlab.freedesktop.org/mesa/mesa/-/issues/3486

Geo25rey commented 4 years ago

I'm not sure if that specific Mesa repo is what I had in mind, but it can't hurt. You probably also want to use this link which I've shared previously. You can probably just copy the issue you've created in the first Mesa repo as a starting point.

lovesegfault commented 3 years ago

I can reproduce this exact problem, but only since moving to kernel 5.11 and not with nouveau, but with amdgpu + i915. I have a laptop with an intel iGPU, and I connect an eGPU (AMD) via TB3.

If I set WLR_DRM_DEVICES=/dev/dri/card0 things work and I have video out through my eGPU. If I set it to /dev/dri/card1 things work and I have video out through my iGPU. If, however, I do not set it at all, or I set it to /dev/dri/card0:/dev/dri/card1, then I can reproduce this exact issue.

I see the [main.c:521] Missing a required Wayland interface error, and sway segfaults.

jakobrs commented 3 years ago

I can reproduce this using a Raspberry Pi 4 as a display adapter, using the gud driver (source).

This appears in journalctl:

Apr 10 11:59:53 growlithe kernel: llvmpipe-4[6344]: segfault at 0 ip 00007fc7e402a387 sp 00007fc710ff8520 error 4
Apr 10 11:59:53 growlithe kernel: Code: cd fe f3 c5 d5 fe fb c5 dd fe db c5 cd fe ed c5 cd fe e4 c4 c1 7d 6f 37 c4 62 75 00 e6 c4 e2 6d 00 d6 c5 f1 ef c9 c5 cd 76 f6 <c4> e2 4d 90 0c 3b c5 c5 76 ff c5 c9 ef f6 c4 e2 45 90 34 1b c5 e1

My primary GPU is an AMD RX 5500.