Open d4g opened 4 years ago
I get the error 21:08:15 - [main.c:521] Missing a required Wayland interface
This error is coming from a Wayland client, not from Sway. Sway trips after:
00:00:01.124 [DEBUG] [backend/drm/drm.c:685] Initializing renderer on connector 'DP-1'
Are you sure there isn't any segfault happening? Can you check coredumpctl -r
?
You are right, there is a core dump. I generated a new one.
We can't do a lot with a coredump, since we don't have the exact same executables and libraries as your system. Can you do coredumpctl gdb
and then bt full
?
gdb) bt full
#0 0x00007f08a0021374 in ?? ()
No symbol table info available.
#1 0xffffffffffffffff in ?? ()
No symbol table info available.
#2 0xffffffffffffffff in ?? ()
No symbol table info available.
#3 0xffffffffffffffff in ?? ()
No symbol table info available.
#4 0xffffffffffffffff in ?? ()
No symbol table info available.
#5 0xffffffffffffffff in ?? ()
No symbol table info available.
#6 0xffffffffffffffff in ?? ()
No symbol table info available.
#7 0xffffffffffffffff in ?? ()
No symbol table info available.
#8 0xffffffffffffffff in ?? ()
No symbol table info available.
#9 0x3a360b6100000000 in ?? ()
No symbol table info available.
#10 0x0000000000000000 in ?? ()
No symbol table info available.
Program terminated with signal SIGBUS, Bus error.
We'll need you to rebuild with debug infos to get anything useful out of it, this isn't much better than the core dump :)
This will take some time. I will look into it.
Here are some instructions to compile from source: https://github.com/swaywm/sway/wiki/Development-Setup#compiling-as-a-subproject
So I built sway myself using nix. I set the mesonbuildtype="debugoptimized"
and the Ddebug=true
flag.
The resulting binary is actually now correct:
sway: ELF 64-bit LSB executable, x86-64, version 1 (SYSV), dynamically linked, interpreter /nix/store/2pi6zgkwnr3zdslvlv16nixpzvbyjx1n-glibc-2.31/lib/ld-linux-x86-64.so.2, for GNU/Linux 2.6.32, with debug_info, not stripped
not stripped and with debug_info. However, I cannot get gdb to actually correctly display the symbols.
I also get the following message in gdb:
Reading symbols from /nix/store/1lmhq04w17ccn0g8hw15w3il6ybm7ydw-sway-unwrapped-1.5/bin/sway...
warning: core file may not match specified executable file.
I have no clue what goes wrong.
The warning is because I specified a symlink instead of the binary in the launchscript. Using the binary itself results in the same issue without the warning.
I think it’s because the shared libs are stripped. I try to go down the rabbit whole.
Actually hold it. I didn't take the time earlier but looking at the stack now it's full of garbage, it's not a missing symbols problem.
Can you rebuild with -Db_sanitize=address
as well? hopefully it should suffice. If not try to run with valgrind.
I did as you suggested @martinetd . Now, it still dumps but coredumpctl says that there is no core dump file created. Was this expected? Under the column “corefile” it says none.
The ASAN information is contained in the debug log.
Do you mean the regulard sway debug log?
I created a new one with the new build: https://gist.github.com/d4g/1ebda4ce87d9e827cc00cff28f0cd5c2
I don't see much difference.
It's normal core dumps get disabled when asan is turned on (dumps get really huge so they disable it), but you should get something like what has been pasted on #5325 for example (==2335==ERROR: AddressSanitizer:
etc etc);
if you don't get anything then it just wasn't a crash asan caught, it can happen if the corruption is inside a lib or something that hasn't been instrumented.
In this case I would recommend going back to the previous rebuild (without asan, with debug infos) and just run sway through valgrind -- it will be slower but since you can reproduce almost immediately it should work well enough for this case.
Thanks for going through all these hoops
I don't see any useful output using the vgcore from valgrind with gdb:
gdb /nix/store/1lmhq04w17ccn0g8hw15w3il6ybm7ydw-sway-unwrapped-1.5/bin/sway vgcore.2386
[...]
Reading symbols from /nix/store/1lmhq04w17ccn0g8hw15w3il6ybm7ydw-sway-unwrapped-1.5/bin/sway...
[...]
Core was generated by `'.
Program terminated with signal SIGBUS, Bus error.
#0 0x000000000efc2390 in ?? ()
[Current thread is 1 (Thread 0x1382c700 (LWP 2518))]
(gdb) bt full
#0 0x000000000efc2390 in ?? ()
No symbol table info available.
#1 0xffffffff00000000 in ?? ()
No symbol table info available.
#2 0x0000000000000000 in ?? ()
No symbol table info available.
That's already after the stack has been messed up (the 0xffffffff00000000 really doesn't make any sense), there might be something in the valgrind report on stderr though?
... well, I was sure valgrind did memory bounds access checks, but I just ran through a simple test program and it didn't catch a basic overflow so I can probably go back to bed. Sorry for the bad suggestion.
if asan doesn't catch it then I'm not sure what would help -- you can try compiling with -fstack-protector-strong
maybe? but I'm not sure it'll catch much more than asan..
If someone else has a better idea feel free to suggest something else, I pass :(
Strace output directly before the coredump:
sysinfo({uptime=43, loads=[91072, 22816, 7648], totalram=33518514176, freeram=32192102400, sharedram=38318080, bufferram=618496, totalswap=34359734272, freeswap=34359734272, procs=349, totalhigh=0, freehigh=0, mem_unit=1}) = 0
brk(0x308e000) = 0x308e000
mmap(NULL, 4096, PROT_READ|PROT_WRITE, MAP_PRIVATE|MAP_ANONYMOUS, -1, 0) = 0x7f1bf42ed000
mmap(NULL, 4096, PROT_READ|PROT_WRITE, MAP_PRIVATE|MAP_ANONYMOUS, -1, 0) = 0x7f1bf42ec000
mprotect(0x7f1bf42ed000, 4096, PROT_READ|PROT_EXEC) = 0
mprotect(0x7f1bf42ec000, 4096, PROT_READ|PROT_EXEC) = 0
mprotect(0x7f1bf42ec000, 4096, PROT_READ|PROT_EXEC) = 0
mmap(NULL, 4096, PROT_READ|PROT_WRITE, MAP_PRIVATE|MAP_ANONYMOUS, -1, 0) = 0x7f1bf42eb000
mmap(NULL, 4096, PROT_READ|PROT_WRITE, MAP_PRIVATE|MAP_ANONYMOUS, -1, 0) = 0x7f1bf42ea000
mprotect(0x7f1bf42eb000, 4096, PROT_READ|PROT_EXEC) = 0
mprotect(0x7f1bf42ea000, 4096, PROT_READ|PROT_EXEC) = 0
mprotect(0x7f1bf42ea000, 4096, PROT_READ|PROT_EXEC) = 0
ioctl(15, DRM_IOCTL_MODE_MAP_DUMB, 0x7ffc25bd2fd0) = 0
mmap(NULL, 16777216, PROT_READ, MAP_SHARED, 15, 0x101035000) = 0x7f1b77000000
mmap(NULL, 10485760, PROT_READ|PROT_WRITE|PROT_EXEC, MAP_PRIVATE|MAP_ANONYMOUS, -1, 0) = 0x7f1b76600000
brk(0x30af000) = 0x30af000
mmap(NULL, 4096, PROT_READ|PROT_WRITE, MAP_PRIVATE|MAP_ANONYMOUS, -1, 0) = 0x7f1bf42e9000
mmap(NULL, 4096, PROT_READ|PROT_WRITE, MAP_PRIVATE|MAP_ANONYMOUS, -1, 0) = 0x7f1bf42e8000
mprotect(0x7f1bf42e9000, 4096, PROT_READ|PROT_EXEC) = 0
mprotect(0x7f1bf42e8000, 4096, PROT_READ|PROT_EXEC) = 0
mprotect(0x7f1bf42e8000, 4096, PROT_READ|PROT_EXEC) = 0
brk(0x30d4000) = 0x30d4000
getpid() = 2995
getpid() = 2995
futex(0x243c8b8, FUTEX_WAKE_PRIVATE, 1) = 1
futex(0x243c868, FUTEX_WAKE_PRIVATE, 1) = 1
futex(0x243ca18, FUTEX_WAKE_PRIVATE, 1) = 1
futex(0x243cb78, FUTEX_WAKE_PRIVATE, 1) = 1
futex(0x243cb28, FUTEX_WAKE_PRIVATE, 1) = 1
futex(0x243ccd8, FUTEX_WAKE_PRIVATE, 1) = 1
futex(0x243ce38, FUTEX_WAKE_PRIVATE, 1) = 1
futex(0x243cde8, FUTEX_WAKE_PRIVATE, 1) = 1
futex(0x243cf98, FUTEX_WAKE_PRIVATE, 1) = 1
futex(0x243d0f8, FUTEX_WAKE_PRIVATE, 1) = 1
futex(0x243d0a8, FUTEX_WAKE_PRIVATE, 1) = 1
futex(0x243d258, FUTEX_WAKE_PRIVATE, 1) = 1
futex(0x243d3b8, FUTEX_WAKE_PRIVATE, 1) = 1
futex(0x243d368, FUTEX_WAKE_PRIVATE, 1) = 1
futex(0x243d518, FUTEX_WAKE_PRIVATE, 1) = 1
futex(0x243d678, FUTEX_WAKE_PRIVATE, 1) = 1
futex(0x243d628, FUTEX_WAKE_PRIVATE, 1) = 1
futex(0x243d7d8, FUTEX_WAKE_PRIVATE, 1) = 1
futex(0x243c918, FUTEX_WAIT_PRIVATE, 0, NULL) = ?
+++ killed by SIGBUS (core dumped) +++
If I start the binary with the -fstack-protector-strong parameter, I get the following output:
00:00:00.640 [sway/config/output.c:717] execvp failed: No such file or directory
WARNING: Kernel has no file descriptor comparison support: Function not implemented
Just for clarification: Running only on the iGPU still works, that is how I actually pasted the message.
Does commenting out any output
commands from your config file change the behavior?
I believe this has something to do with loading the background image.
@Geo25rey I changed it to the default config. The execvp message disappears, which is still weird.
There is only one output line output * bg ~/pictures/wallapaper/XMG_GRID_Wallpaper_2017_IMAGE_4K.jpg stretch
which also works if I start sway on the iGPU only.
The error message WARNING: Kernel has no file descriptor comparison support: Function not implemented
stays. It still core dumps btw.
The part that we are not discussing about right now, is that if I start sway using the dGPU as first entry in the WLR_DRM_DEVICES, sway does indeed startup and kind of works, at least as long as I only use programs on the iGPU connected display. But it lags when transitioning the mouse cursor from the iGPU display to the dGPU display. Also it starts to take up all CPU ressources as soon as I start opening windows on the dGPU connected display. This does not make sense to me as well. As the dGPU is handling all rendering in this scenario, why would it lag when output is generated on the display that is directly connected to the dGPU?
As the dGPU is handling all rendering in this scenario, why would it lag when output is generated on the display that is directly connected to the dGPU?
The rendering happens on the iGPU, but OpenGL (llvmpipe) is still used to perform buffer copies on the dGPU for scan-out.
@emersion i thought the 1st item in WLR_DRM_DEVICES does the rendering?
So
WLR_DRM_DEVICES $iGPU:$dGPU sway
would render on the iGPU and use llvmpipe to copy the buffer to the dGPU. This is the scenario that’s dumping the core.
Meanwhile WLR_DRM_DEVICES $dGPU:$iGPU sway
Should render on the dGPU and copy the buffer to the iGPU. This scenario has the extreme performance issues when running programs on the display connected to the dGPU while everything on the iGPU connected display works.
Yes, but we still need to copy the rendered buffer to the scanout GPU.
Ideas to improve this: https://github.com/swaywm/wlroots/issues/1347
@emersion I still don’t get why this would cause the slowdown on the scenario 2 of my last comment. Can you try to explain to me on this example why and how the slowdown occurs only on the dGPU display if the dGPU does all the rendering?
Does anybody have any more ideas how to debug this?
I still don’t get why this would cause the slowdown on the scenario 2 of my last comment.
I don't know either.
@d4g Can you describe your hardware configuration?
Laptop or desktop? Built in dedicated graphics or external dedicated graphics? If external, thunderbolt or pcie? CPU? If not custom built, make and model?
Laptop or desktop?
Laptop
Built in dedicated graphics or external dedicated graphics?
Built in internal graphix with dgpu via nvidia optimus. The Thunderbolt3/USB-C port is connected directly to the dgpu as well as the hdmi port
If external, thunderbolt or pcie?
nope
CPU?
Intel Core i7-9750H
If not custom built, make and model?
Intel-TongFang QC7 / XMG Fusion 15 with RTX2070 https://www.xmg.gg/en/xmg-fusion-15
The Thunderbolt3/USB-C port is connected directly to the dgpu as well as the hdmi port
@d4g What does this mean?
What happens if you don't set the environment variable "WLR_DRM_DEVICES"?
The Thunderbolt3/USB-C port is connected directly to the dgpu as well as the hdmi port
@d4g What does this mean?
That both connectors for external displays are connected to the dGPU and the internal display is connected to the iGPU.
@d4g Maybe try using the swictheroo driver mentioned here
Sadly, this won’t help, as the laptop does not have a hardware mux and the external ports are connected to the dGPU.
Perhaps the nouveau driver just isn't there yet. Have you opened an issue there?
What happens if you don't set the environment variable "WLR_DRM_DEVICES"?
Just tried. It behaves as if I specify the iGPU first.
I still don't know where this error message comes from:
WARNING: Kernel has no file descriptor comparison support: Function not implemented
Perhaps the nouveau driver just isn't there yet. Have you opened an issue there?
Not yet. Every time I look at there web presence, I get scared.
https://nouveau.freedesktop.org/wiki/ has last seen news posted January 2019.
I think that might be the right place to submit a bug, after searching for 30 minutes: https://gitlab.freedesktop.org/drm/nouveau/-/issues
If that's the case, there is only 6 bugs and they have 0 comments. Is there another repo?
The nouveau driver was last updated 28 days ago in the Linux kernel. The libdrm portion of the driver is included with the Mesa driver, last updated 4 months ago. And, although irrelevant, the xf86-video-nouveau driver was last updated 3 weeks ago. So, you shouldn't have anything to worry about activity-wise.
Do you have a recent version of Mesa installed?
mesa-20.1.4 and libdrm-2.4.102
In which repo would you suggest to submit the issue?
I would assume the Mesa repo I linked
But what should I post there? There is no dedicated nouveau error message. The only thing we see is that the stack of sway gets garbled.
Probably #1347 would also eliminate the error on the sway side.
Submitted issue https://gitlab.freedesktop.org/mesa/mesa/-/issues/3486
I'm not sure if that specific Mesa repo is what I had in mind, but it can't hurt. You probably also want to use this link which I've shared previously. You can probably just copy the issue you've created in the first Mesa repo as a starting point.
I can reproduce this exact problem, but only since moving to kernel 5.11
and not with nouveau, but with amdgpu
+ i915
. I have a laptop with an intel iGPU, and I connect an eGPU (AMD) via TB3.
If I set WLR_DRM_DEVICES=/dev/dri/card0
things work and I have video out through my eGPU. If I set it to /dev/dri/card1
things work and I have video out through my iGPU. If, however, I do not set it at all, or I set it to /dev/dri/card0:/dev/dri/card1
, then I can reproduce this exact issue.
I see the [main.c:521] Missing a required Wayland interface
error, and sway
segfaults.
I can reproduce this using a Raspberry Pi 4 as a display adapter, using the gud driver (source).
This appears in journalctl
:
Apr 10 11:59:53 growlithe kernel: llvmpipe-4[6344]: segfault at 0 ip 00007fc7e402a387 sp 00007fc710ff8520 error 4
Apr 10 11:59:53 growlithe kernel: Code: cd fe f3 c5 d5 fe fb c5 dd fe db c5 cd fe ed c5 cd fe e4 c4 c1 7d 6f 37 c4 62 75 00 e6 c4 e2 6d 00 d6 c5 f1 ef c9 c5 cd 76 f6 <c4> e2 4d 90 0c 3b c5 c5 76 ff c5 c9 ef f6 c4 e2 45 90 34 1b c5 e1
My primary GPU is an AMD RX 5500.
Please fill out the following:
Sway Version:
Debug Log:
Description:
21:08:15 - [main.c:521] Missing a required Wayland interface