0.20 and 0.21 break something

patrik-cihal commented 7 months ago

2 Hours wasted on an error that randomly appears in the version 0.20 and 0.21. I'm sending a lot of requests between client and server and around 120th request it always got stuck and they stop communicating.

Absolutely random error.

I look at my old client that worked and hmm it is using 0.19 version. I try it and it works! 2 hours trashed yay.

daniel-abramov commented 7 months ago

There was one set of the performance improvements that touched the internals that deal with input/output. I remember I checked some of them (the tokio-tungstenite part). The rest were verified by the original author of tungstenite. The changes were quite large, so before we invest much time in the investigation, I'd like to make sure that the problem exists and that it's not a bug in a user code (95% of reported "bugs" that we get are bugs in a user code, not in tungstenite).

Would you provide a simple example(s) that reproduce/demonstrate the issue? That would help us to verify the code and see if/how the bug is triggered.

PS: Also, would you mind explaining what you mean by "it always got stuck and they stop communicating"? What exactly gets stuck? Sending? Receiving?

patrik-cihal commented 7 months ago

I have two equal versions except they differ only in tokio-tungstenite version, one works, the other doesn't. I don't see how this could be my fault.

This is the client part: https://github.com/stellar-bit/client-basic/blob/main/src/network/desktop_network_client.rs This is the server part: https://github.com/stellar-bit/server-template/blob/main/src/client_handle.rs

It never awaits at roughly 120th time when the client flushes the message. Even tho the server is receiving by that point.

agalakhov commented 7 months ago

Can you please git bisect?

daniel-abramov commented 7 months ago

I have two equal versions except they differ only in tokio-tungstenite version, one works, the other doesn't. I don't see how this could be my fault.

It may easily be a classical race condition. The changes I mentioned in my previous comment did affect the library's performance in a positive way. That's why I'm asking for more information to understand if that's the case.

This is the client part: https://github.com/stellar-bit/client-basic/blob/main/src/network/desktop_network_client.rs This is the server part: https://github.com/stellar-bit/server-template/blob/main/src/client_handle.rs

It never awaits at roughly 120th time when the client flushes the message. Even tho the server is receiving by that point.

I'm sorry, but this does not answer my question—what exactly "gets stuck"? Is it reading? (await never returns.) Is it sending? What exactly hangs?

I briefly checked the client part, and I still don't know what you mean by "it never awaits" - my only assumption is that you refer to the fact that ws_receiver.next().await never gets called after processing of one of the messages. If that's what you're saying, then I very much believe so, but that would indicate a bug in a client code (to be completely honest it's tough to read the code, and I see several red flags like e.g., the async code being mixed with non-async CPU bound code or I/O bound code where many non-async synchronization primitives from the standard library used - there are many places which would lock the Tokio runtime should the acquisition/release not happen in a timely manner).

To put it short, I'd be glad to help, but I need more information, specifically - an example code showing what hangs and answers to my original questions. Now, after seeing the code, I'd rather believe in a race in a user code (this does not exclude the possibility of a bug in tungstenite, but I have not seen anything that would substantiate the claims).

patrik-cihal commented 7 months ago

Hi! Sorry that I'm not putting much effort into this. I gave you the information that tokio-tungstenite 0.19 magically fixes an error that just so happens at around 120th request between client and server. More specifically I'm 'awaiting' the flushing of messages in the client. That part doesn't ever finish even when server should be receiving at that time.

I honestly don't need your help. I'm fine with using 0.19.

Additionally, it is really suspicious to me that it always gets stuck at around 120th request I don't see anything in my code that could correlate with this.

So from my observations I deduced that it is more likely to be an error in the tungstenite crate. If you think that it was more likely an error on my part you are free to ignore this.

sdroege commented 7 months ago

Around 120? 128 maybe? That sounds very much like you're running into a side effect of https://tokio.rs/blog/2020-04-preemption

If that happens to be the case, in summary: you're blocking a tokio runtime thread somewhere. That would also match with @daniel-abramov's comment a bit earlier, and would mean that you were just lucky in the past.

patrik-cihal commented 7 months ago

Around 120? 128 maybe? That sounds very much like you're running into a side effect of https://tokio.rs/blog/2020-04-preemption

If that happens to be the case, in summary: you're blocking a tokio runtime thread somewhere. That would also match with @daniel-abramov's comment a bit earlier, and would mean that you were just lucky in the past.

I see. Weird that 0.19 doesn't have this issue even at 50x the amount of requests.

I'll post here if that fixes the issue.

patrik-cihal commented 7 months ago

I dug deeper and with tokio-console I discovered this. The receive_task in all tokio-tungstenite versions is reporting that it is idle even tho ws_receive.next().await clearly has processed the next message:

let ws_stream = connect_async(server_addr)
  .await
  .map_err(|_e| NetworkError::WebsocketTrouble)?
  .0;
let (ws_sender, mut ws_receiver) = ws_stream.split();
let (sync_response_sender, sync_response_receiver) = mpsc::channel();

let receive_task = {
  tokio::task::spawn(async move {
      while let Some(msg) = ws_receiver.next().await {
          println!("Clearly not idle {:?}", msg);
      }
  })
};

Then in the newer versions once it is polled approximately 123 times the polling frequency skyrockets and the program crashes.

In the older version 0.19 this doesn't happen and the polling still continues at roughly the same speed, the task however still reports as if it were idle...

I am confused as to what could be the issue.

The snippet is from: https://github.com/stellar-bit/client-basic/blob/main/src/network/desktop_network_client.rs.

patrik-cihal commented 7 months ago

Screenshot for reference: pic-selected-240329-1031-01

patrik-cihal commented 7 months ago

Alright it gets stuck exactly after 127 messages have been sent between client and server.

patrik-cihal commented 7 months ago

I dug deeper and with tokio-console I discovered this. The receive_task in all tokio-tungstenite versions is reporting that it is idle even tho ws_receive.next().await clearly has processed the next message:

This doesn't seem to be relevant as even without the receive_task it gets stuck.

agalakhov commented 7 months ago

Tungstenite does not count messages. If it is after exactly 127 messages, it must be somewhere else. Do you have any i8 somewhere?

patrik-cihal commented 7 months ago

0.19 Allows me to send infinitely many messages, 0.20 and 0.21 when flushing 128th message or so get stuck and never finish.

patrik-cihal commented 7 months ago

Btw I tried recreating it but failed, maybe it is something more specific to my project but I don't know exactly what.

patrik-cihal commented 7 months ago

One more thing I noticed is that after 127th message the polling increases rapidly in the tokio task scheduler for the receive task in all versions.

sdroege commented 7 months ago

Things getting stuck after exactly 128 operations with tokio is with 99% certainty the problem I mentioned earlier. I had to debug that quite often in other's code by now.

Make a minimal testcase for your problem. Remove whatever is unnecessary from your current code until the problem starts disappearing or you have something very minimal. That should make it quite clear what happens here.

patrik-cihal commented 7 months ago

Why would that be a problem only with the newer versions of tungstenite. I have tried recreating it and deleting everything even the receive task but as long as I'm sending messages the error persists.

I already spent 6 hours on this and would rather just use 0.19 then investigate further...

sdroege commented 7 months ago

What's the most minimal testcase you came up with that reproduces this problem? The one you linked in https://github.com/snapview/tokio-tungstenite/issues/324#issuecomment-2025475125 is a lot of unrelated code, including the whole stellar-bit-core crate.

patrik-cihal commented 7 months ago

This one is a pretty minimal cilent test: https://github.com/stellar-bit/client-network-test The server part is this: https://github.com/stellar-bit/server-template, the server shouldn't be the issue however.

sdroege commented 7 months ago

Both functions you call from your main are blocking the tokio runtime that is wrapped around main via the tokio::main attribute macro. That's most likely the cause here. Don't do that, e.g. make both functions async and just await inside them instead of using futures::executor::block_on().

patrik-cihal commented 7 months ago

The function must be synchronous, because of the way the game engine is designed. I tried using futures::executor::block_on() to let me make it synchronous still... Is there other better way then?

Btw I've just noticed that adding tokio::time::sleep().await after the print also fixes the issue.

patrik-cihal commented 7 months ago

Btw I've just noticed that adding tokio::time::sleep().await after the print also fixes the issue.

Not in my main project however.

sdroege commented 7 months ago

The function must be synchronous, because of the way the game engine is designed. I tried using futures::executor::block_on() to let me make it synchronous still... Is there other better way then?

That's a problem then :)

Don't use #[tokio::main] around your main function, or spawn all this onto a different thread where blocking is allowed (via std::thread::spawn() or tokio::task::spawn_blocking() for example).

That won't work though because the futures you're using there depend on running inside the tokio runtime (tokio-tungstenite does because it uses tokio), so you have some kind of circular problem here: you must call this code from inside the tokio runtime but you also require it to actually block, which is forbidden inside the tokio runtime (or any async runtime for that matter).

In your case it's even worse because you're currently running a second minimal runtime inside the tokio runtime (via futures::executor::block_on()).

You'll have to design your code differently one way or another. If your game engine requires synchronous, blocking code then you could for example run that on a completely different thread, have the network IO run on the tokio runtime and communicate between the game engine thread and the tokio runtime via channels.

Btw I've just noticed that adding tokio::time::sleep().await after the print also fixes the issue.

When/how/if this happens is all a bit of luck depending on the surrounding code.

agalakhov commented 7 months ago

If you're using synchronous blocking functions within Tokio, it won't work correctly. Use await if you want to block. If your synchronous functions don't support this, use a separate thread.

Tokio and Tungstenite can't receive anything while blocked. Synchronous functions aren't just blocking a single function, they block everything. Tungstenite loses messages just because at the time of message arrival there is no Tungstenite working.

There is major difference between blocking and await. The latter does not block! await effectively means "do something else while waiting", and that's when Tungstenite does its job. If the code doesn't spend enough time in awaits, especially if there is no await at all, Tokio can't work correctly. It is cooperative multitasking, cooperative means "tasks have to be nice to each other", and "being nice" is await in Rust. Synchronous blocking functions are just too greedy.

patrik-cihal commented 7 months ago

I know that I just wasn't careful with using futures::executor::block_on function, mainly I wasn't aware that was still running in the tokio runtime since the function itself was 'synchronous'.

The fact that this error appeared only with tokio tungstenite versions >0.19. didn't help.

Thanks for the time everyone.

Btw I ended up fixing it by creating the tokio runtime separately and that way only the things I intend are ran actually inside it.

snapview / tokio-tungstenite

0.20 and 0.21 break something #324