lichess-org / lila

♞ lichess.org: the forever free, adless and open source chess server ♞
https://lichess.org
GNU Affero General Public License v3.0
15.61k stars 2.28k forks source link

websocket disconnects after pairing #11989

Open nowords123 opened 1 year ago

nowords123 commented 1 year ago

https://lichess.org/forum/lichess-feedback/lag-spike-at-game-start-unable-to-move-and-game-is-aborted-after-time-limit-is-reached?

We are trying to see if there is a socket disconnection issue and which side it is on pls can we get some help

ddugovic commented 1 year ago

Indeed, I tried briefly looking at this; hopefully other developers looking at issues can help put it into perspective since at the moment it seems to be over my head. From that forum thread:

I play a bunch of 1/0 bullet, and every 3 to 8 games I'll start a game, but I can't make my first move.

I believe this is the result of a lag spike that usually occurs at the beginning of games, as this is the only time I ever have problems with connection.

Sometimes at the start of games, and no other times, there will be a massive disconnect. It will often say reconnecting right after I realize that although my piece has been moved, it doesn't yet say that it is my opponents turn. It will then do one of a few things.

  1. It may get back connection after 5 or so seconds and I can play as if nothing happened.
  2. It will go down to the wire and last second I'll get a connection
  3. It won't ever reconnect between game start and abort timer end. In other words, I can move and the game is aborted.

Of the three above, it seems like 3 is the most likely. 2 almost never happens, and 1 happens almost as much as 3.

When starting a game, the first step seems to be to connect two players together, once connected, I believe that is when the abort timer begins, in my mind, there is an additional network step between when I can move and when the timer begins, and I think it is this step where the problem arises. I imagine it would also have to occur in an operation which would be expected to take almost no time at all and would be very consistent and thus would be difficult to find. This is all my speculation, I don't really have a clue about the systems involved, but I still hope my ponderings can provide some inspiration into the problem.

It may be a me problem instead of a server problem as well, as when I play bullet on my phone with low net speed and on app, I don't have any problems, but when I play on my computer sometimes I get a major lag spike (a temporary total disconnection of computer or server) only at the start of games. This I imagine would make the problem much harder to find, so here is a bit of computer info for you.

Windows 10, lotsa ram, Brave browser (like chrome but with ad-blockers and data tracker blockers), I'm pretty far from France I think and I don't use a VPN.

... and finally:

OK, Toadofsky's post made me think, what else can i see from client-side.

Overall, today is bad diagnosis day, as this "lagging" happened really few compared to 3 previous days. Only some tournaments were affected from 9-15 EET+2

Also turns out, that recording whole session with Mozilla developer tools performance analyzer requests too much memory (or overflows memory), to get result of one tournament saved as profile.

But, i was able to catch several interesting findings, while running Lichess in special "focus mode" with developer tools network analyzer turned on First i hoped, that i found culprit, but not so sure now.

Findings so far. Quite often there is blockage on webpage loading request, reason i can not tell, but once it took 14,98 secs and this tournament game is was timed out. get socket is sometimes quite slow, often 300ms but also reached 7-14 seconds in worst cases, but looks like it does not depend, witch of the sockets is free (from socket 1 -to 5)

But, there are some get TLS session problems, and those mach more with "disconnect feature", as they are quite long -from 5-7 secs following with message - reconnecting in 3500 ms. One game i lost exactly while this setup TLS timed out - with no answer or reason to request.

All those things happened with Firefox, only one tab open. (maybe should try plenty of tab open too, to compare.

For fact-hungry persons i can provide screenshots from those timeouts too.

schlawg commented 1 year ago

possible ws connection status & lila onlineUserSet out of sync issue

ddugovic commented 1 year ago

I am relaying comments from https://lichess.org/forum/lichess-feedback/lag-spike-at-game-start-unable-to-move-and-game-is-aborted-after-time-limit-is-reached?page=3#21 which may or may not apply in this context:

Next example is hard to read, not nesccesarily spam. Those requests make most of the problems, as they get often stalled:

You can also see timings on message reguests themselves.

WebSocketv5?sri=2EM8hafMx4G1&v=22 /tournament/WIFzR85V/socket/v5 wss://socket1.lichess.org/tournament/WIFzR85V/socket/v5?sri=2EM8hafMx4G1&v=22 GET 101 websocket [socket1.lichess.org](https://socket1.lichess.org/) websocket 0 B 15.88 s upgrade nginx WebSocketv5?sri=3c7px7DBhxA0&v=23 /tournament/WIFzR85V/socket/v5 wss://[socket1.lichess.org](https://socket1.lichess.org/)/tournament/WIFzR85V/socket/v5?sri=3c7px7DBhxA0&v=23 GET 101 websocket [socket1.lichess.org](https://socket1.lichess.org/) websocket 0 B 16.97 s upgrade nginx WebSocketv5?sri=6jCyQ6EY4KIp&v=23 /tournament/WIFzR85V/socket/v5 wss://socket1.lichess.org/tournament/WIFzR85V/socket/v5?sri=6jCyQ6EY4KIp&v=23 GET 101 websocket socket1.lichess.org websocket 0 B 6.79 s upgrade nginx WebSocketv5?sri=A9Hits96plAP&v=23 /tournament/WIFzR85V/socket/v5 wss://[socket5.lichess.org](https://socket5.lichess.org/)/tournament/WIFzR85V/socket/v5?sri=A9Hits96plAP&v=23 GET 101 websocket socket5.lichess.org websocket 0 B Pending upgrade nginx WebSocketv5?sri=aaG81qGJMxsk&v=7 /tournament/WIFzR85V/socket/v5 wss://[socket1.lichess.org](https://socket1.lichess.org/)/tournament/WIFzR85V/socket/v5?sri=aaG81qGJMxsk&v=7 GET 101 websocket [socket1.lichess.org](https://socket1.lichess.org/) websocket 0 B 8.89 s upgrade nginx WebSocketv5?sri=COZFmLr8Ys0v&v=11 /tournament/WIFzR85V/socket/v5 wss://[socket1.lichess.org](https://socket1.lichess.org/)/tournament/WIFzR85V/socket/v5?sri=COZFmLr8Ys0v&v=11 GET 101 websocket [socket1.lichess.org](https://socket1.lichess.org/) websocket 0 B 5.52 s upgrade nginx WebSocketv5?sri=Ff8qrmAMhix_&v=7 /tournament/WIFzR85V/socket/v5 wss://[socket1.lichess.org](https://socket1.lichess.org/)/tournament/WIFzR85V/socket/v5?sri=Ff8qrmAMhix_&v=7 GET 101 websocket [socket1.lichess.org](https://socket1.lichess.org/) websocket 0 B 13.24 s upgrade nginx WebSocketv5?sri=kysHtX_zn7at&v=17 /tournament/WIFzR85V/socket/v5 wss://[socket1.lichess.org](https://socket1.lichess.org/)/tournament/WIFzR85V/socket/v5?sri=kysHtX_zn7at&v=17 GET 101 websocket [socket1.lichess.org](https://socket1.lichess.org/) websocket 0 B 5.63 s upgrade nginx WebSocketv5?sri=n19sdQv0TIIX&v=10 /tournament/WIFzR85V/socket/v5 wss://[socket1.lichess.org](https://socket1.lichess.org/)/tournament/WIFzR85V/socket/v5?sri=n19sdQv0TIIX&v=10 GET 101 websocket [socket1.lichess.org](https://socket1.lichess.org/) websocket 0 B 6.10 s upgrade nginx WebSocketv5?sri=P2nYXG54CrIe&v=23 /tournament/WIFzR85V/socket/v5 wss://[socket1.lichess.org](https://socket1.lichess.org/)/tournament/WIFzR85V/socket/v5?sri=P2nYXG54CrIe&v=23 GET 101 websocket [socket1.lichess.org](https://socket1.lichess.org/) websocket 0 B 11.50 s upgrade nginx WebSocketv5?sri=QC5odHRfYcHc&v=11 /tournament/WIFzR85V/socket/v5 wss://[socket1.lichess.org](https://socket1.lichess.org/)/tournament/WIFzR85V/socket/v5?sri=QC5odHRfYcHc&v=11 GET 101 websocket [socket1.lichess.org](https://socket1.lichess.org/) websocket 0 B 12.06 s upgrade nginx WebSocketv5?sri=u9Jc6MsUgth4&v=13 /tournament/WIFzR85V/socket/v5 wss://[socket1.lichess.org](https://socket1.lichess.org/)/tournament/WIFzR85V/socket/v5?sri=u9Jc6MsUgth4&v=13 GET 101 websocket [socket1.lichess.org](https://socket1.lichess.org/) websocket 0 B 1.27 s upgrade nginx WebSocketv5?sri=W6eoM283P65J&v=23 /tournament/WIFzR85V/socket/v5 wss://socket1.lichess.org/tournament/WIFzR85V/socket/v5?sri=W6eoM283P65J&v=23 GET 101 websocket socket1.lichess.org websocket 0 B 8.59 s upgrade nginx WebSocketv5?sri=Wjr4t3SbfG_7&v=23 /tournament/WIFzR85V/socket/v5 wss://socket1.lichess.org/tournament/WIFzR85V/socket/v5?sri=Wjr4t3SbfG_7&v=23 GET 101 websocket socket1.lichess.org websocket 0 B 5.75 s upgrade nginx WebSocketv5?sri=xD9_HLZBXdvd&flag=tournament /socket/v5 wss://socket1.lichess.org/socket/v5?sri=xD9_HLZBXdvd&flag=tournament GET 101 websocket socket1.lichess.org websocket 0 B 1.05 s upgrade nginx WebSocketv5?sri=yyn05E1aiLLJ&v=20 /tournament/WIFzR85V/socket/v5 wss://socket1.lichess.org/tournament/WIFzR85V/socket/v5?sri=yyn05E1aiLLJ&v=20 GET 101 websocket socket1.lichess.org websocket 0 B 12.17 s upgrade nginx WebSocketv6?sri=68pkAmCEJvNf&v=0 /play/Q3Buj6WRPAyb/v6 wss://socket1.lichess.org/play/Q3Buj6WRPAyb/v6?sri=68pkAmCEJvNf&v=0 GET 101 websocket socket1.lichess.org websocket 0 B 47.43 s upgrade nginx WebSocketv6?sri=6jtBCCIvZ_FA&v=0 /play/RDSuYehNV2Ql/v6 wss://socket1.lichess.org/play/RDSuYehNV2Ql/v6?sri=6jtBCCIvZ_FA&v=0 GET 101 websocket socket1.lichess.org websocket 0 B 2.1 min upgrade nginx WebSocketv6?sri=9Ua6X_y8uf_n&v=0 /play/1mCYlL0zriwn/v6 wss://socket1.lichess.org/play/1mCYlL0zriwn/v6?sri=9Ua6X_y8uf_n&v=0 GET 101 websocket socket1.lichess.org websocket 0 B 46.02 s upgrade nginx WebSocketv6?sri=_qeSficwWqsK&v=0 /play/oz9XiRFQq898/v6 wss://socket1.lichess.org/play/oz9XiRFQq898/v6?sri=_qeSficwWqsK&v=0 Finished socket1.lichess.org websocket 0 B (unknown) WebSocketv6?sri=_qeSficwWqsK&v=0 /play/oz9XiRFQq898/v6 wss://[socket5.lichess.org](https://socket5.lichess.org/)/play/oz9XiRFQq898/v6?sri=_qeSficwWqsK&v=0 GET 101 websocket socket5.lichess.org websocket 0 B 2.2 min upgrade nginx WebSocketv6?sri=_XLeRQvaRRNI&v=0 /play/Yh0S1s5jEc1M/v6 wss://[socket1.lichess.org](https://socket1.lichess.org/)/play/Yh0S1s5jEc1M/v6?sri=_XLeRQvaRRNI&v=0 GET 101 websocket [socket1.lichess.org](https://socket1.lichess.org/) websocket 0 B 2.0 min upgrade nginx WebSocketv6?sri=b7j1OzgxJ0lJ&v=0 /play/weKduVYiOIr4/v6 wss://[socket1.lichess.org](https://socket1.lichess.org/)/play/weKduVYiOIr4/v6?sri=b7j1OzgxJ0lJ&v=0 GET 101 websocket socket1.lichess.org websocket 0 B 2.1 min upgrade nginx WebSocketv6?sri=EN4reiGByWp_&v=0 /play/f32V2syXU0oZ/v6 wss://socket1.lichess.org/play/f32V2syXU0oZ/v6?sri=EN4reiGByWp_&v=0 GET 101 websocket socket1.lichess.org websocket 0 B 1.4 min upgrade nginx

looks like those previous messages contain mainly chat presence, public presense in chat or tournament and etc. Next example message -got stalled for 8,5 seconds:

4
18:31:33.207
{"t":"startWatching","d":"aIfPXkp0"} 36
18:31:33.207
{"t":"reload"} 14
18:31:33.314
0 1
18:31:33.356
{"t":"fen","d":{"id":"aIfPXkp0","lm":"e7e6","fen":"r1bqkbnr/pp1p1ppp/2n1p3/2p5/4P3/1Q6/PPPP1PPP/RNB1KBNR w","wc":59,"bc":57}} 125
18:31:33.356
{"t":"fen","d":{"id":"aIfPXkp0","lm":"b3f3","fen":"r1bqkbnr/pp1p1ppp/2n1p3/2p5/4P3/5Q2/PPPP1PPP/RNB1KBNR b","wc":58,"bc":57}} 125
18:31:33.430
{"t":"crowd","d":{"nb":20,"users":["cadetblue","pirkarpenter","Shaurya1303","tom230","BuDDhInIAM","dimoss","ppermas_3_0","bozkurtfrkn","GenaBerezov","Diksha_Alluri","kudzu12","basile73","Basico2k","lestrand","Emmanuelv7","chumbichess","Manova-jeu","theindiansniper","Ishanth43","Gleb_6_let"],"anons":0}} 303
18:31:33.954
{"t":"fen","d":{"id":"aIfPXkp0","lm":"g7g6","fen":"r1bqkbnr/pp1p1p1p/2n1p1p1/2p5/4P3/5Q2/PPPP1PPP/RNB1KBNR w","wc":58,"bc":56}} 127
18:31:34.215
{"t":"reload"} 14
18:31:34.446
{"t":"fen","d":{

There are also some random errors like this:

lichess.min.js:36 WebSocket connection to 'wss://socket1.lichess.org/play/oz9XiRFQq898/v6?sri=_qeSficwWqsK&v=0' failed: WebSocket is closed before the connection is established.
I.disconnect @ lichess.min.js:36
I.destroy @ lichess.min.js:36
I.connect @ lichess.min.js:36
(anonymous) @ lichess.min.js:36
setTimeout (async)
I.scheduleConnect @ lichess.min.js:36
I.connect @ lichess.min.js:36
I @ lichess.min.js:36
q @ round.min.js:1
(anonymous) @ oz9XiRFQ:9
Promise.then (async)
(anonymous) @ oz9XiRFQ:9

And some cosmetic bug -in every page load:

Error with Permissions-Policy header: Origin trial controlled feature not enabled: 'interest-cohort'.

22lichess.min.js:33 Uncaught TypeError: Cannot read properties of undefined (reading 'cancel')
at HTMLDivElement.mouseenter (lichess.min.js:33:3639)
at HTMLDivElement.ya (lichess.min.js:21:306)
mouseenter @ lichess.min.js:33
ya @ lichess.min.js:21
22lichess.min.js:33 Uncaught TypeError: Cannot read properties of undefined (reading 'hide')
at HTMLDivElement.mouseleave (lichess.min.js:33:3706)
at HTMLDivElement.ya (lichess.min.js:21:306)
mouseleave @ lichess.min.js:33
ya @ lichess.min.js:21
Error with Permissions-Policy header: Origin trial controlled feature not enabled: 'interest-cohort'.
Kudzu12 commented 1 year ago

Here i can provide several client side logs .har format, you can use them via browser developer tools feature -import to see all saved internals. Logs are created by Firefox and Vivaldi (chromium based) You can see dates by file creating time and usually there is also included time played in EET+2 time format in file name itself.

I try to provide those more problematic ones. Allalaadimised.zip more har files.zip

If looking inside logs, you can consider my ping very stable 48-49 milliseconds. No packet-loss or whatsoever lagging against ISP and lichess.org itself while those tournaments are recorded

Everything else (media, fonts, board items, audio, java-script) looks like running very fast micro and milliseconds. Fetches and WS (websocket), its connection upgrade process are most problematic areas.

Some ideas i currently have:

i do not think, it's software. Too random are those socket stalls, drops. I can imagine something like that happening on overload or congestion of network, something that congests cache, CPU for short periods.

There is need to see what happens on and between lichess virtual servers those times.

I do not know, if they did change anything on server level yesterday or today, but today most tournaments i was present ran almost perfectly without hiccups whatsoever.

If any questions or need something more specific data, please ask.

Kudzu12 commented 1 year ago

Some thoughts, do not know if this is right place to post them or who may be interested if anyone at all.

For being successful in more complicated issue tracking, specially those like socket errors, happening possibly in server software level or on database interactions level and/or networking stack congestion's should be taken into account also on par with servers overall load monitoring, with timestamps if needed.

So, if servers are in located OVH hosted environment, do anybody from server administering persons or developing teems have access to virtual machinery's load monitoring? Usually hosting providers are not too keen to make this kind access directly available. If direct monitoring is possible, it would be easier to track down client-side events regarding to server internals.

If OVH is not allowing access to servers performance monitoring or graphs, then there are possibility to add something like Nagios system or equivalent monitoring on hosting system by admins.

Another way to achieve such task is to run some simple scripts like atop or similar, paying attention to CPU, Memory and filesystem IO wait and stats and create alerts by findings.

As for client-side, isn't it possible to make some small add-on script with on/off "knob"t o client interface, who will volunteer in client-side monitoring, activating performance analyzer or network-tools logging then by script in their browsers and automatic sending of logs after time intervals. There may be also some trigger like if socket is lost closed or something, then log and send it with timestamp.

So, just ideas... maybe that wheel is allready invented and i just do not know it :)

schlawg commented 1 year ago

As for client-side, isn't it possible to make some small add-on script with on/off "knob"t o client interface, who will volunteer in client-side monitoring, activating performance analyzer or network-tools logging then by script in their browsers and automatic sending of logs after time intervals. There may be also some trigger like if socket is lost closed or something, then log and send it with timestamp.

That is very much the goal of #11991. We would start small with lichess log statement persistence and collection but once that is working, data from any number of analysis tools can be added if they opt in. Regarding event triggers such as unexpected socket errors, the current plan is to give users the option to review everything they'd be sending at the point of submission. This is mainly to reinforce that we're not harvesting their grocery lists and christmas card labels. Implementations may always be revised, but this is my current plan.

Kudzu12 commented 1 year ago

.... Regarding event triggers such as unexpected socket errors, the current plan is to give users the option to review everything they'd be sending at the point of submission. This is mainly to reinforce that we're not harvesting their grocery lists and christmas card labels. Implementations may always be revised, but this is my current plan.

Last one shouldn't be too big deal. I just tried as experiment both with Mozilla and Vivaldi browser - network tools log runs by default only in boundaries of one tab it was initially started on, so browsing on other pages does not affect log collected as long network analyzer is not started on that tab too.

But, for security reasons or for paranoid people - overview of sent material may be safer if they commit log sending by themselves. That of cause limits sending process automation.

schlawg commented 9 months ago

Hey all, we've got the client log delivery mechanism in place (finally). Can anyone still reproduce this though?

Kudzu12 commented 9 months ago

I do not see implementation on lichess clients web interface. So how could one invoke this log delivery?

schlawg commented 9 months ago

https://lichess.org/#debug but we are very sparse with logging there. Log statements need to be tailored to each issue. Currently we have statements in place for general lila-ws troubles but nothing centered around pairing. That's why I wanted to reach out and see if you guys could still reproduce this specific issue. I could add some statements and try to gather us useful data after the next asset deployment.

Kudzu12 commented 9 months ago

Hi there.

For me it was never really pairing or lag issue. I will touch few ideas here, but there is big time since... i really payed attention. And for most of the time (last months), server performance has got much better.

As i do not see server workloads in realtime or in logs, it's a guess, based on my own experience with servers.

It's propably java VM problem. I have seen similar thing on SAP enterprice production services. On electronic factory production, we never found total final solution for java issues, as production server was needed to be restarted after some 2-3 days, even when java VM parameters were fine tuned. And when it (java VM) got to it's limits - it slowed down all server processes by overhouling servers farm memory. From that point - restart and everything was fine till next time.

I suggest you to compare perfomance on basis of clients gateway to server. Mobile platform does much better, for whatever reason it does. Looked from here (and yeah by - guessing) and measuring performance with web developer tools, it was mostly websockets timeouts, closes and stalls inside server farm.

As answer, today there was no longer websocket times than 1,3 second. No socket drops yet. Issue, if still there, may be daytime dependant.

We will see in close furure. Unfortunately, i can not measure things on the evenings.


Saatja: Jonathan Gamble @.> Saadetud: esmaspäev, 22. jaanuar 2024 12:02 Adressaat: lichess-org/lila @.> Koopia: Kudzu12 @.>; Comment @.> Teema: Re: [lichess-org/lila] websocket disconnects after pairing (Issue #11989)

https://lichess.org/#debug but we are very sparse with logging there. Log statements need to be tailored to each issue. Currently we have statements in place for general lila-ws troubles but nothing centered around pairing. That's why I wanted to reach out and see if you guys could still reproduce this specific issue. I could add some statements and try to gather us useful data after the next asset deployment.

— Reply to this email directly, view it on GitHubhttps://github.com/lichess-org/lila/issues/11989#issuecomment-1903642810, or unsubscribehttps://github.com/notifications/unsubscribe-auth/A4SSC2BXOMRBROASUIRBYRTYPY2JTAVCNFSM6AAAAAASSZICMWVHI2DSMVQWIX3LMV43OSLTON2WKQ3PNVWWK3TUHMYTSMBTGY2DEOBRGA. You are receiving this because you commented.

schlawg commented 9 months ago

So you experience poor websockets performance/uptime in general? There is an alternate IP with different routing we can try. PM me on lichess if you're interested.