snapview / tokio-tungstenite

Future-based Tungstenite for Tokio. Lightweight stream-based WebSocket implementation
MIT License
1.88k stars 236 forks source link

EXC_BAD_ACCESS on MacOS in StartedHandshakeFuture #346

Closed tmpfs closed 2 months ago

tmpfs commented 3 months ago

Hi,

I am running up against a bad memory access (EXC_BAD_ACCESS) error on MacOS and wonder if you have any ideas of the root cause. Using tokio-tungstenite@0.23.1 on the client (the server is using 0.21.0 but I doubt that is relevant).

The relevant line from the crash log stack trace is:

0   sos_bindings                           0x10da64f44 _$LT$tokio_tungstenite..handshake..StartedHandshakeFuture$LT$F$C$S$GT$$u20$as$u20$core..future..future..Future$GT$::poll::h7b86609b850870e1 + 24 (handshake.rs:142)

I have a few of these crash logs now and attach one for reference, in all cases the code appears to be trying to access memory in the stack guard as documented here.

Some useful context is that my code implements automatic reconnect logic (with exponential backoff - you can see the calls to WebSocketChangeListener::delay_connect in the stack trace) and to achieve that I am using the async_recursion crate.

Any pointers or ideas would be much appreciated 🙏

sos-crashlog-21-08-2024.txt

daniel-abramov commented 2 months ago

So let me start by saying that the error does not originate from tokio-tungstenite; we don't even have unsafe code that could have triggered this behavior within the library.

While I did not have time to go through the details of the crash report, I noticed that the crashing thread had a suspiciously large set of calls to WebSocketChangeListener in a call stack, indicating that it might have been called recursively. I'd suggest checking this part - it may have an endless recursion or something similar that could indirectly lead to the state that results in the crash you mentioned. This is my primary hypothesis so far. I would be cautious with async_recursion usage (chances are - you don't really need it for your use case; incorrect usage may cause problems).

I hope this helps with further investigations!

tmpfs commented 2 months ago

Thanks @daniel-abramov for taking a look, I had come to the same conclusion that the recursion is the issue. I will refactor to a supervisor task and send disconnect events over a channel and then I can avoid the recursion. Thanks for clarifying that there is no unsafe code that would be triggering this, that's useful.