swaywm / sway

i3-compatible Wayland compositor
https://swaywm.org
MIT License
14.57k stars 1.11k forks source link

Sway starts slowly and disturbs shutdown #2784

Closed lephe closed 6 years ago

lephe commented 6 years ago

Whenever I start recent sway versions on my laptop, it freezes during 18 seconds before showing the background (it is normally instantaneous) and htop shows that it in fact consumes the whole 18 seconds of processor time. I can't even switch TTY during that time, although everything works fine once sway has started.

At shutdown, sway seems to never exit; the TTY from which I try to stop it (be it TTY1 if using shift-mod-e or another one if using kill -15 or kill -9) freezes. I am unable to shut down my system unless I hold the power button pressed for a few seconds.

I confirmed that it happens in 1.0-alpha.6 and 1.0-alpha.5; I did not manage to enable the wlroots experimental flag to build previous 1.0 versions.

I am willing to dive into the code to fix the issue, but I have never contributed to sway so I'll need some advice to locate the exact version and problem.

emersion commented 6 years ago

Can you provide debug logs? Profiling with perf might help.

lephe commented 6 years ago

Ok so debuglog on is refused (sway claims the option doesn't exist), using it generates an error message with swaynag and breaks more things.

With sway -d, I've got a debug file with a possible clue for the processor time issue:

2018-10-07 12:15:29 - [backend/drm/drm.c:81] Found 0 DRM planes
nvc0_screen_create:983 - Error allocating PGRAPH context for M2MF: -16
2018-10-07 12:15:47 - [render/egl.c:145] Using EGL 1.4

There are no other errors besides xkbcomp things. At shutdown, everything goes smoothly after Shutting down sway except for these two errors (I suspect the second is when I hold down the power button):

(EE) failed to write to XWayland fd: Broken pipe
...
Gdk-Message: 12:18:33.222: Error reading events from display: Broken pipe
Exiting due to signal.
XIO:  fatal IO error 11 (Resource temporarily unavailable) on X server ":0"
      after 438 requests (437 known processed) with 0 events remaining.

I attached the log file: sway.log

I may also mention (not sure if it's related) that the change that introduced this problem also broke a few GTK dialogs in my favorite software; but only if the window opening the dialog is in stacked or tabbed layout; split layout works fine. (I don't have the slightest idea what's going on, and I hope I didn't mess up the install.)

emersion commented 6 years ago

It seems you have two graphics cards: one Intel GPU and one NVIDIA GPU with nouveau. Somehow it seems there's an issue without the NVIDIA GPU: it doesn't detect any CRTC. This seems like a nouveau bug, it may be useful to report it ("nvc0_screen_create:983 - Error allocating PGRAPH context for M2MF: -16"). One other weird thing is that the GL context for the NVIDIA GPU is initialized with "GL vendor: VMware, Inc.".

If you don't have outputs connected to your NVIDIA GPU, can you try exporting WLR_DRM_DEVICES=/dev/dri/card0 before starting sway? This disables the NVIDIA GPU and only uses the Intel one.

emersion commented 6 years ago

the change that introduced this problem

Can you do a git bisect to know exactly which commit has introduced this issue?

lephe commented 6 years ago

Yes, I have hybrid graphics. My laptop is recent and I have not yet attempted to configure it in details because I need to study the topic. Setting WLR_DRM_DEVICES works very well, I can get my boot and shutdown back. :grinning: The issue with dialogs not working seems independent, I will submit a report or a patch if I manage to isolate that (later).

I'm actually quite happy with only using the Intel card because I care about battery life, but I will report the nouveau problem upstream. Is the CRTC detection entirely done by nouveau?

Can you do a git bisect to know exactly which commit has introduced this issue?

I will try this, although it's somewhat difficult because I have to bisect on wlroots as well - many of my earlier tries did not build.

lephe commented 6 years ago

The thing is, 1.0-alpha.4 only detects one GPU (sway.log). I'm still working out how to bisect between these since none of their wlroots version is able to build both.

ddevault commented 6 years ago

wlroots=4ed6ee0, sway=1.0-alpha.5 fails wlroots=2a58d44, sway=1.0-alpha.4 works

This isn't a bisect and isn't very useful. The problem probably lies in wlroots - use alpha.6 and bisect wlroots between 2a58d44 and 4ed6ee0.

https://git-scm.com/docs/git-bisect

This is a really wide range of commits, so you might be better off trying to find the first "bad" commit manually by hopping back from master a few dozen commits at a time.

ddevault commented 6 years ago

If you run into build failures, try building against a sway commit near the date of the wlroots commit you're testing.

emersion commented 6 years ago

You can just also try rootston so you don't have to compile sway.

ddevault commented 6 years ago

Here are some likely suspects:

~/s/wlroots > git log --oneline 2a58d44..4ed6ee0 backend/drm/
2d8f53af Check for DRM prime
1a2b3445 Remove unused data from gbm_bo userdata
4bee710c Fix hardware cursor on secondary GPU
e547e55b multi-gpu: do not flip screens on secondary GPU
15dacebc multi-backend: do not expose internal renderers
364afced backend/drm: remove unnecessary casts
2ebecb67 backend/drm: allow to pass empty gamma ramp to reset it
a149c237 Implement wlr-gamma-control-unstable-v1
lephe commented 6 years ago

This isn't a bisect and isn't very useful.

I know... sorry for not being very useful. I honestly tried my best this morning and failed almost 10 combinations in a row, don't even mention bisecting. Hopefully I can improve!

The problem probably lies in wlroots - use alpha.6 and bisect wlroots between 2a58d44 and 4ed6ee0.

So I did this by hand using rootston to avoid version conflicts (thanks emersion!) and I found that the bug is introduced by cb42e16f:

session: load GPU devices even if they have zero crtcs/connectors/encoders

On some systems (most notably laptops with two GPUs) there are GPUs that
don't have attached outputs. However, we still want to load those GPUs
because they could still be used by the compositor for rendering.

This is very close to what emersion pointed out.

ddevault commented 6 years ago

I know... sorry for not being very useful. I honestly tried my best this morning and failed almost 10 combinations in a row, don't even mention bisecting. Hopefully I can improve!

You probably would have found this much easier if you just ran a bisect!

lephe commented 6 years ago

Well, bisecting over sway (I first assumed the bug was here) would have left me with the problem of selecting appropriate wlroots commits to build with. Am I completely wrong here? Also, whenever I find a "bad" commit I have to wildly reboot, and the laptop doesn't seem to like it that much (UEFI randomly takes twice as long to boot?). Anyway.

So obviously wlroots loads my nVIDIA card on purpose, but then fails because it doesn't have a CRTC. I'm not sure whether the bug is in wlroots or nouveau because I'm not familiar with graphics cards. Any thoughts?

lephe commented 6 years ago

Ok, so I've been looking at my graphics card configuration, and I must admit it doesn't work well at all. Currently no application is able to use the nVIDIA card, so it's unlikely that there's anything wrong with sway.

I'll close this issue. Thanks for helping me fixing the situation, I can keep working with sway without worries. :)

nimarb commented 6 years ago

Hey @lephe , I have the same issue (lenovo T480s with Intel IGP and nvidia graphics card). The issue persists even if I set WLR_DRM_DEVICES=/dev/dri/card0 before starting sway.

How have you solved the issue?

Perhaps the graphics card could only be loaded by wlroots if it has a CRTC? EDIT: I just saw that an issue for that was already create by emersion: #2816

emersion commented 6 years ago

2816 only happens if you have plugged in more outputs than your GPU supports.

Logs?

lephe commented 6 years ago

Hey @lephe , I have the same issue (lenovo T480s with Intel IGP and nvidia graphics card). The issue persists even if I set WLR_DRM_DEVICES=/dev/dri/card0 before starting sway.

How have you solved the issue?

In my case setting WLR_DRM_DEVICES was enough. The latency I had at startup was caused by some operation on the GPU, which I could reproduce when using Bumblebee's optirun. So using only the Intel card switfly solved it.

Your logs will tell, but you may want to check that /dev/dri files are mapped the same way on your system.

2816 only happens if you have plugged in more outputs than your GPU supports.

Is it relevant when the GPU has no CRTC?