ppy / osu

rhythm is just a *click* away!
https://osu.ppy.sh
MIT License
15.13k stars 2.25k forks source link

Multiplayer should be resilient to websocket disconnections #14365

Open rednir opened 3 years ago

rednir commented 3 years ago

Describe the bug: People seem to get kicked from multiplayer rooms whilst in the middle a playing a match, followed with the notification "Connection to multiplayer server was lost."

Happened about 3-ish times within 40 minutes (and it wasn't just me getting disconnected)

osu!lazer version: 2021.815.0

Logs: runtime.log network.log

peppy commented 3 years ago

From logs:

2021-08-17 16:44:12 [error]: System.AggregateException: A Task's exception(s) were not observed either by Waiting on the Task or accessing its Exception property. As a result, the unobserved exception was rethrown by the finalizer thread. (An unexpected error occurred invoking 'ChangeUserMods' on the server. InvalidStateException: Incompatible mods were selected: HR,PF,SO,CL,SI,GR,BR,MU)
frenzibyte commented 3 years ago

Looks to be OsuModGrow somehow got in the mix with OsuModSpinIn, which are both incompatible. I'll look into this further.

frenzibyte commented 3 years ago

The mod incompatibility portion would be a duplicate of #12874

https://user-images.githubusercontent.com/22781491/129829793-4bab7e3b-bef9-4360-8af0-fd8efb58ebc3.mp4

[runtime] 2021-08-18 02:57:24 [error]: System.AggregateException: A Task's exception(s) were not observed either by Waiting on the Task or accessing its Exception property. As a result, the unobserved exception was rethrown by the finalizer thread. (An unexpected error occurred invoking 'ChangeUserMods' on the server. InvalidStateException: Incompatible mods were selected: GR,TC)
[runtime] 2021-08-18 02:57:24 [error]: ---> Microsoft.AspNetCore.SignalR.HubException: An unexpected error occurred invoking 'ChangeUserMods' on the server. InvalidStateException: Incompatible mods were selected: GR,TC
[runtime] 2021-08-18 02:57:24 [error]: at Microsoft.AspNetCore.SignalR.Client.HubConnection.InvokeCoreAsyncCore(String methodName, Type returnType, Object[] args, CancellationToken cancellationToken)
[runtime] 2021-08-18 02:57:24 [error]: at System.Threading.Tasks.ForceAsyncAwaiter`1.GetResult()
[runtime] 2021-08-18 02:57:24 [error]: at Microsoft.AspNetCore.SignalR.Client.HubConnection.InvokeCoreAsync(String methodName, Type returnType, Object[] args, CancellationToken cancellationToken)
[runtime] 2021-08-18 02:57:24 [error]: --- End of inner exception stack trace ---
[runtime] 2021-08-18 02:57:24 [error]: An unobserved error has occurred. (inner)
[runtime] 2021-08-18 02:57:24 [error]: Microsoft.AspNetCore.SignalR.HubException: An unexpected error occurred invoking 'ChangeUserMods' on the server. InvalidStateException: Incompatible mods were selected: GR,TC
[runtime] 2021-08-18 02:57:24 [error]: at Microsoft.AspNetCore.SignalR.Client.HubConnection.InvokeCoreAsyncCore(String methodName, Type returnType, Object[] args, CancellationToken cancellationToken)
[runtime] 2021-08-18 02:57:24 [error]: at System.Threading.Tasks.ForceAsyncAwaiter`1.GetResult()
[runtime] 2021-08-18 02:57:24 [error]: at Microsoft.AspNetCore.SignalR.Client.HubConnection.InvokeCoreAsync(String methodName, Type returnType, Object[] args, CancellationToken cancellationToken)

However, I'm not sure what exactly has got you kicked from the multiplayer room as that seems different from the mod incompatibility failure which doesn't affect anything.

Looking at the logs again, it seems you've indeed got booted out while in gameplay:

021-08-17 16:46:02 [verbose]: Polling adjusted (listing: 0, selection: 0)
2021-08-17 16:46:02 [verbose]: Screen changed → PlayerLoader
2021-08-17 16:46:02 [verbose]: Polling adjusted (listing: 0, selection: 0)
2021-08-17 16:46:02 [verbose]: Polling adjusted (listing: 0, selection: 0)
2021-08-17 16:46:06 [verbose]: Screen changed → MultiplayerPlayer
2021-08-17 16:46:22 [important]: Connection to multiplayer server was lost.
2021-08-17 16:46:22 [verbose]: Screen changed ← PlayerLoader
2021-08-17 16:46:22 [verbose]: Screen changed ← Multiplayer
2021-08-17 16:46:22 [verbose]: Polling adjusted (listing: 0, selection: 0)
2021-08-17 16:46:22 [verbose]: Polling adjusted (listing: 15000, selection: 15000)
2021-08-17 16:46:23 [verbose]: Focus contention triggered by NotificationOverlay.

And the network logs show that the connection was closed without any exceptions thrown, oddly:

2021-08-17 16:46:22 [verbose]: OnlineMultiplayerClient disconnected
2021-08-17 16:46:22 [verbose]: OnlineMultiplayerClient connecting...
2021-08-17 16:46:22 [verbose]: Performing request osu.Game.Online.Rooms.SubmitRoomScoreRequest
2021-08-17 16:46:22 [verbose]: Request to https://osu.ppy.sh/api/v2/rooms/97272/playlist/296848/scores/1257860 successfully completed!
2021-08-17 16:46:22 [verbose]: Performing request osu.Game.Online.Rooms.PartRoomRequest
2021-08-17 16:46:22 [verbose]: OnlineMultiplayerClient connected!
2021-08-17 16:46:22 [verbose]: Request to https://osu.ppy.sh/api/v2/rooms/97272/users/17204559 successfully completed!
2021-08-17 16:46:22 [verbose]: Performing request osu.Game.Online.Rooms.GetRoomsRequest
2021-08-17 16:46:23 [verbose]: Request to https://osu.ppy.sh/api/v2/rooms successfully completed!
rednir commented 3 years ago

I played multi again with one of the players that were having issues, and nobody seemed to be disconnecting while that player did several times. They had Connection to multiplayer server was lost in their notifications but also API connection was lost, can't continue with online play. Unfortunately I don't have their logs, but this just sounds like their internet connection is dropping for a couple seconds.

So I don't think this as much of an issue as I made it out to be, that's my bad. But I do find it strange that when I disconnected, in my network.log file above there are no timeout errors. Either way you might want to consider closing this.

peppy commented 3 years ago

Would definitely help to get more logs of these issues. I think we do need to better handle this situation, but first need to understand what's happening. SignalR should be able to recover from disconnections itself.

TheThirdDave commented 2 years ago

A friend and I were experiencing this issue on a semi-regular basis this week. Is there still a need for logs on the issue? We did notice it seemed to be only happening to the host of the multiplayer room from our observations.

I can try to reproduce the issue and attach the logs here.

smoogipoo commented 2 years ago

Yes, we do still need logs.

Doxxz commented 2 years ago

Regular occurence to me. I've had it happen when in game, when spectating, and on song select. I suspected it was a Linux only issue, but I asked around and it seems Windows 11 users also get it, not sure about Win10. I use no proxy servers or vpn. I even tried turning my firewall off but I still got kicked eventually.

Logs: network.log runtime.log

peppy commented 2 years ago

For anyone reporting these issue, it would be useful to know your network setup.

TheThirdDave commented 2 years ago

Sorry for delay.

No proxy or vpn setup. Friend that it happened to is on wifi and I'm hardwired to the router.

Issue occurred mid song while we were playing.

It's happened to both of us pretty evenly. Both are Win11.

These logs are my friends logs that I was playing with to try and recreate the issue. I was able to track down in the logs to the exact time of the issue timestamp: 2022-03-31 00:23:59

runtime.log network.log

Let me know if you need anymore info or would like us to try and do specific things to try and reproduce!

peppy commented 2 years ago

@TheThirdDave to confirm, did both you and your friend get disconnected at the same time?

TheThirdDave commented 2 years ago

@peppy no just my friend was disconnected.

wooster0 commented 2 years ago

@peppy This has also happened to me thrice now in the exact same room, while playing ingame, no spectator, with only Connection to multiplayer server was lost in my notifications. I play with LAN/Ethernet, no WiFi, and no VPN or anything. And I don't think I use any proxy or my router does any ad blocking stuff or anything. Here are my logs from the same session. I haven't closed osu! since the disconnects. First it happened to me as the room host and subsequently as a guest of that room. Also, it was always just me who got disconnected. We never used any mods and it also happened within about 40 minutes.

database.log network.log performance.log runtime.log

I have to say I don't really remember experiencing this issue this often. I was playing with one other person who lives very far away from me and we were playing low-difficulty beatmaps. I don't know if it has anything to do with that. Maybe it's also somehow more common with only 2 people?

smoogipoo commented 2 years ago

@r00ster91 Your logs are even more concerning because there's no error in there, just disconnections.

Enabling low-level SignalR logging doesn't look like an option because the docs mention that logging is unsafe, so I'm weary of doing that in production. If any of the users can repro the issue on dev.ppy.sh (via using a debug build of the game), then maybe we can do something there.

The weird part is in all of these cases, the reconnection happens almost immediately without issue. Simply increasing the timeout time (which is already 20 secs) is unlikely to help us, especially not in the cases where there is no timeout (two of the above cases).

I've also tested that blocking the SignalR thread has no effect client-side, so unless this is a server-side blocking issue (which I believe to have previously confirmed can cause timeouts) I don't think it can be that.

I've gone pretty deep into SignalR docs + source at this point and can't really find anything to go by...

peppy commented 2 years ago

As mentioned on discord in followup, the logs show that this is a websocket disconnection which is something we actually need to account for

2022-04-11 13:58:15 [verbose]: OnlineMultiplayerClient connection error: System.InvalidOperationException: Websocket closed with error: EndpointUnavailable.
2022-04-11 13:58:15 [verbose]: at Microsoft.AspNetCore.Http.Connections.Client.Internal.WebSocketsTransport.StartReceiving(WebSocket socket)
2022-04-11 13:58:15 [verbose]: at System.IO.Pipelines.Pipe.GetReadResult(ReadResult& result)
2022-04-11 13:58:15 [verbose]: at System.IO.Pipelines.Pipe.GetReadAsyncResult()
2022-04-11 13:58:15 [verbose]: at Microsoft.AspNetCore.SignalR.Client.HubConnection.ReceiveLoop(ConnectionState connectionState)

SignalR doesn't provide any kind of built-in connection resilience (best documentation i can find is at https://docs.microsoft.com/en-us/aspnet/core/signalr/dotnet-client?view=aspnetcore-6.0&tabs=visual-studio#handle-lost-connection), so we need both client and server-side refactoring to ensure that a dropped connection does not cause the user to immediately lose state.

Actual cause of lost connections is not too relevant, if the user and server connection are stable it's highly likely to be a cloudflare level implementation detail. Websocket connections aren't guaranteed to be persistent forever.

Client-side will require some kind of queueing, similar to what SpectatorClient is doing. Server-side will require decoupling user state from connections, and also some kind of message queue, ordering guarantee, or equivalent path to recovery (ie. is it fine for any/all messages to not reach the client? if not, how do we guarantee that the client is in the correct state on reconnection?).

Quite a bit to consider.

smoogipoo commented 2 years ago

the logs show that this is a websocket disconnection

There are 3 different logs above, only one of which has the exception there. Hard to say if the others are also websocket disconnection or something else low-level in the TCP stack (OS/hardware).

Client-side will require some kind of queueing

I would probably not do queueing. I think it is unexpected from a UX point of view to suddenly have 20 actions firing off because the button wasn't clicking for the past 10 seconds.

how do we guarantee that the client is in the correct state on reconnection

Not having queueing also makes this relatively simple - JoinRoom already puts the room in the correct state.

peppy commented 2 years ago

Not having queueing also makes this relatively simple - JoinRoom already puts the room in the correct state.

Yeah, for now that's the case server-side. We'll need to consciously keep that true going forward, if we want to be sure that's the case.

peppy commented 2 years ago

As an update, as of the last release the multiplayer endpoint has been rerouted in a way such that disconnections to be about as rare as one would expect from osu!(stable) / bancho.

Going to remove the priority from this issue for now. We may eventually want to revisit this in the future.