pop-os / cosmic-comp

Compositor for the COSMIC desktop environment
GNU General Public License v3.0
475 stars 85 forks source link

Surface-local render threads #546

Closed Drakulix closed 3 months ago

Drakulix commented 3 months ago

Here is an unreasonable large PR, that really got out of hand.

Depends on https://github.com/Smithay/smithay/pull/1393

The main purpose of all these changes is to render each output on a separate thread. However it does a lot more (list in no particular order or claim of completeness):

This does run unreasonably stable for the amount of testing done, but is absolutely not recommended for production. I think I catched the most egregious deadlocks, but expect crashes and deadlocks. For some reason Xwayland seems to have even more problems with this build in my testing.

Almost all deadlocks so far are in parts of the code that try to lock the shell and also either access the pointer or keyboard handles, which use locks internally. Conceptionally seats are part of the shell, but since they need the full state accessible for grabs, they are frequently cloned out of the shell, which can lead to cases, where both locks are taken in a different order by a surface thread and the main thread, locking both indefinitely (and thus also all other surface-threads).

While I have been able to work around most of these cases with relative ease, I am not convinced yet, that this is a problem worth ignoring. I made attempts to fix this in smithay (and be able to pass a subtype of State to the input-logic), but this spawned a new discussion on structuring wayland-rs. If this problem remains it might introduce unreasonable debugging-churn for development, as deadlocks might be frequently accidentally introduced while developing. I don't know how we can figure this out without just giving it a shot.

Another solution to bypass this problem rather than fixing this in smithay would be to keep generation of Elements for rendering in the main thread (and still drive the whole scheduling logic from the main thread) and then pass those off to a surface-specific render+commit thread. Element-generation is a major contributor to time spend on a single frame, so this doesn't really give us the same benefits to be able to quite easily drive multiple high-refresh rate outputs. On the other side it would make it much more realistic to make the render+commit-thread realtime in the future.

Before merge this needs a better git history for reviewing (which I'll attempt tomorrow) and a lot more testing.

Drakulix commented 3 months ago

And of course just as I am about to sign off, I notice that xdg-popups (and it looks like override-redirect X11 window as well) at least don't render correctly. Another thing for the todo list for tomorrow.

Drakulix commented 3 months ago

Git history cleaned up as good as possible.

ids1024 commented 3 months ago

Almost all deadlocks so far are in parts of the code that try to lock the shell and also either access the pointer or keyboard handles, which use locks internally. Conceptionally seats are part of the shell, but since they need the full state accessible for grabs, they are frequently cloned out of the shell, which can lead to cases, where both locks are taken in a different order by a surface thread and the main thread, locking both indefinitely (and thus also all other surface-threads).

Even without a change like this, Smithay has had several issues with deadlocks in the KeyboardHandle. So some improvement to how locks are used there could definitely be welcome.

ids1024 commented 3 months ago

Using cosmic-workspaces on this branch seems to result in an unresponsive desktop other than cursor movement (can't interact with or close cosmic-workspaces, and opening a terminal doesn't show any change either).

Drakulix commented 3 months ago

Using cosmic-workspaces on this branch seems to result in an unresponsive desktop other than cursor movement (can't interact with or close cosmic-workspaces, and opening a terminal doesn't show any change either).

Should be fixed by latest commit.

Quackdoc commented 3 months ago

I've been trying this for a couple hours and I have found it to crash a fair amount. I did provide logs but Im not sure they show anything

ran cosmic-comp > /tmp/cosmic-comp-stdout.txt 2> /tmp/cosmic-comp-stderr.txt directly from cli and these are the logs from it

A) first issues i've been having are crashes when using gamescope to play games, cosmic-comp-stdout.txt cosmic-comp-stderr.txt

B) cosmic-comp will sometimes fail to load outright, causing a locked up black screen that requires the user to turn the keyboard into RAW mode using sysrq, then swap tty, kill the server, and try again, no logs get spat out (and it doesnt really happen when redirecting the output)

C) Displays would sometimes act extremely strange, wrong resolution and location cosmic-comp-stdout.txt cosmic-comp-stderr.txt

D) crash when using wlr to toggle second display off and on cosmic-comp-stdout.txt cosmic-comp-stderr.txt

Drakulix commented 3 months ago

Appreciating the detailed feedback, but no the logs aren't super helpful.

For one, for these experimental branches it is worth enabling debug logging, by increasing the default log level to debug: https://github.com/pop-os/cosmic-comp/blob/master/Cargo.toml#L49 (release_max_level_debug).

And to capture crashes you also want to set RUST_BACKTRACE=1. (You probably also want to do debug builds to get better backtraces, though that might impact performance quite a bit. As a workaround we have a fastdebug profile, so building with --profile fastdebug is also an option.)

A) first issues i've been having are crashes when using gamescope to play games

Looking into it.

B) cosmic-comp will sometimes fail to load outright, causing a locked up black screen that requires the user to turn the keyboard into RAW mode using sysrq, then swap tty, kill the server, and try again, no logs get spat out (and it doesnt really happen when redirecting the output)

I am testing this branch on my dev system for a couple of days and have definitely noticed this. In my testing it happened more consistently, if I had multiple monitors attached during launch, can you confirm this?

C) Displays would sometimes act extremely strange, wrong resolution and location

Could you elaborate this more with some examples? It might also be worth while to upload your output-configuration from .local/state/cosmic-comp/outputs.ron and compare the current state with the config file and what you would expect.

D) crash when using wlr to toggle second display off and on.

Great, this should be easy to reproduce, looking into it.

Drakulix commented 3 months ago

Also I am pretty sure there is a memory-leak somewhere causing crashes after just running this branch for a while.

Quackdoc commented 3 months ago

can't recompile atm but will when I can

when using gamescope getting the crashes only occurred when actually resizing gamescope I did go and verify that it was indeed stable enough when just playing but repeatedly resizing the window (I was able to replicate it using multiple wev windows for tiling to occur) I was playing genshin impact when it occured, I will try other applications when I get the chance.

as for the strange resolution issues, I was actually able to replicate it by just running cosmic-comp on old versions too, they stopped occurring when I used cosmic-settings to reset the resolution, the error never occurred when using cosmic-session to start so I was mistaken about the cause.

ids1024 commented 3 months ago

Testing this branch on one computer, I noticed drag and drop surfaces not showing (in cosmic-workspaces, nautilus, etc.).

Drakulix commented 3 months ago

Testing this branch on one computer, I noticed drag and drop surfaces not showing (in cosmic-workspaces, nautilus, etc.).

fixed.

ids1024 commented 3 months ago

Looking at 845e376d3df982032ded363f1140f0470663c6aa, the way insert/insert_if_missing doesn't make data visible in other threads does seem error prone (not that that's a new issue)...

I guess we're still looking for a better way to manage data like that in Smithay and wayland-rs.

XV-02 commented 3 months ago

With the 9fc1372 build, I was seeing issues when interacting with tabs in Firefox (from https://github.com/pop-os/packaging-firefox/commit/5a2511492b335bfce0b827104e061f5d88027201).

The behaviour seen was as follows:

  1. The content within Firefox itself would flicker, turning black and then showing again.
  2. The entire user session would become sluggish. I would describe the experience as laggy: long delays between updating, but ultimately all user inputs seemed to be processed. This means that, if I were typing in a console, I would type a string of text, wait a few seconds, and then the entire string would appear.
  3. The first time I saw this, it seemed to recover when I switched away from Firefox and shutdown an unrelated VM; the second time, I had to kill Firefox as the session was growing increasingly unstable

The behaviour seemed to be triggered at anytime tabs were interacted with: switching tabs, closing tabs, looking through all the tabs in the drop-down that appears when tabs over flow.

In terms of hardware, I'm using a Pang12. The system has Ryzen 7 6800U with Radeon 680M integrated graphics. The laptop did not have any external displays, and was running cosmic-term, virtual machine manager (without any active VMs), and slack alongside Firefox. I am using tiling, and I am using stacks.

XV-02 commented 3 months ago

With da1f597, attempting to switch to a virtual console and back to a running cosmic-comp session, I am met with a black screen. cosmic-comp itself is still running. This compares with 254c583 where I can successfully cycle through virtual consoles and return to the composited session without issue.