Closed wb9688 closed 2 years ago
What is errno set to when this happens? As the comment says this is probably fatal, but I would be interested to know what killed the connection.
The errno
is 32
(broken pipe).
Oh, this sounds very much like #4389. But now we have a repro case! Why the pipe is broken I have absolutely no clue, maybe WAYLAND_DEBUG=1 will tell us?
Here is the full output from SuperTuxKart with WAYLAND_DEBUG=1
:
Note that errno: 32
is the output of my printf
.
Huh, I'm actually not entirely sure what's happening - no errors/warnings are printed, just about the only thing I can see is a gap of time between the last presentation and the moment it disconnects (after all the motion events). Whatever this is, we can at least narrow it down to wlroots since mutter/kwin don't appear to be affected. This might be a violation on our part, could also just be an aggressive timeout or something.
I am also able to reproduce it in Weston.
Hm, I can't personally reproduce with SuperTuxKart on Wayfire (it loads very quickly for me in any case).
Too many unread buffered Wayland events breaking the pipe is definitely a possibility, but I've only encountered it with e.g. FreeCAD doing heavy calculations for many minutes.
In any case, hanging on the main thread is no good. Toolkits like SDL can't/shouldn't really do anything about that. Except… something silly with threads I guess? which might worsen the latency in the good fast-running loop case.
This one might need the attention of someone who knows the protocol and/or libwayland-client a little better... anybody know a compositor dev we can bother for this one?
Is there anything printed in the compositor logs? Is it "error in client communication"?
libwayland-server will disconnect clients after their send buffer fills up. The buffer can hold 4096 bytes + the internal socket buffer in the kernel.
Can you strace
the game? If it fails in a sendmsg
syscall, that's likely the cause.
I am indeed seeing [11:03:34.477] libwayland: error in client communication (pid 6694)
.
Well, that explains that... the question is, do we have a means of preventing that overflow? The part that worries me is that in this case it's the mouse, which makes me wonder if high-resolution mice cause the limit to get hit a lot faster.
(Also, similar to hidden surfaces, you don't have to look far to find major examples of games that load without presenting/pumping events - OpenGL got people in a nasty habit of doing that since threading was nontrivial.)
As I said, a toolkit can only do something silly with threads…
Actually maybe not that silly, the description of wl_display_read_events
seems to imply that it's thread-safe and will just read from the pipe, putting all the events into client-side queues when other threads are not reading. Maybe just create a background thread that calls that function in a loop?
The background thread makes sense for reading during a block at least... we have small, low-priority threads for stuff like hotplugging events too. We also don't have any strict rules on what thread events come from, so this wouldn't break any rules on the SDL side (as far as I know). We just have to make sure it's low priority and that it doesn't spin super hard. Maybe we just have to move the IOReady block to a thread? That would also let us move the SwapBuffers hack too, if I'm not mistaken.
With just wl_display_read_events
you wouldn't even change "what thread events come from" as that function only moves the stuff from the pipe to libwayland's client-side queues, your code would drain libwayland queues as usual. The only question here is whether libwayland queues are bounded. But yeah, if you can make even deeper changes to how events are handled, that's also good.
A wl_display_read_events
thread shouldn't spin at all, IIUC it would always be either waiting for the pipe to be ready to read OR waiting on locks to make sure other threads aren't ready to read.
It's not clear this issue should be fixed in clients. Here's a discussion about it: https://gitlab.freedesktop.org/wayland/wayland/-/issues/159
Compositors could detect stalled clients and stop sending them input events.
Oof, so my mouse concern was real:
Unfortunately, this is a very real issue. With a short simulated freeze of any Wayland client along with a high-resolution input device (1000Hz mouse in my case), it's fairly easy to cause libwayland-server to terminate the client's connection and cause its process to exit. This applies to Xwayland as a Wayland client as well, which can cause the compositor itself to exit such as in the case of Mutter.
I'm open to helping out with this, but yeah let's put the SDL side on hold in favor of the server side.
From a recent GNOME blog...
https://blogs.gnome.org/shell-dev/2021/12/08/an-eventful-instant/
The showstopper was probably what you would suspect the least: applications that are not handling events. If an application is not reading events in time (is temporarily blocking the main loop, frozen, slow, in a breakpoint, …), these events will queue up.
But this queue is not infinite, the client would eventually be shutdown by the compositor. With these input devices that could take a long… less than half a second. Clearly, there had to be a solution in place before we rolled this in.
There’s been some back and forth here, and several proposed solutions. The applied fix is robust, but unfortunately still temporary, a better solution is being proposed at the Wayland library level but it’s unlikely to be ready before GNOME 42. In the mean time , users can happily shake their input devices without thinking how many times a second is enough.
Ignoring the part where apps are definitely taking events and 1000+Hz mice are just ridiculous: does anyone know where the better proposal is? I'm aware of Stoeckl's resizable buffer workaround but I haven't seen anything outside of what's in the report in emersion's link.
The better solution is https://gitlab.freedesktop.org/wayland/wayland/-/merge_requests/188
At this point it seems like the consensus is that this is something that should (and will) be fixed in the IPC library and not the application - users who run into this should follow:
So on SDL's end we're going to depend on the fixes mentioned above. If this gets flip-flopped and becomes an application issue this can be reopened.
When I move my mouse while SuperTuxKart is starting with
SDL_VIDEODRIVER=wayland
,SDL_SendQuit()
gets called at https://github.com/libsdl-org/SDL/blob/402b86f2a88b5ee3e112dbde2ec232ee5f36572f/src/video/wayland/SDL_waylandevents.c#L245 for some reason. This should obviously not happen. It happens in at least Wayfire and Sway.Edit: When I put a breakpoint on
SDL_SendQuit()
, the following is the backtrace:Edit 2: I am sure the
SDL_SendQuit()
is from the line I linked, because when I add aprintf
the line before it, it prints stuff.