machinezone / IXWebSocket

websocket and http client and server library, with TLS support and very few dependencies
BSD 3-Clause "New" or "Revised" License
512 stars 167 forks source link

Memory leak in proxy code #490

Open martyharris opened 8 months ago

martyharris commented 8 months ago

Hi,

Firstly, thanks for your great library.

After building a web socket proxy based on your IXWebSocketProxyServer.cpp code, I believe I have found a memory leak. If you were to add a print statement to the ~ProxyConnectionState() destructor you will see it is never called. I believe this is a result of capturing the shared pointer to this derived state in the lambda functions, which increases the reference count.

In my code I was therefore forced to 'uncapture' the state in the Close sections, by setting new empty handlers using:

setOnMessageCallback( [ ] (...) { } );

This is unpleasing as changing a std::function in runtime is not atomic and not thread-safe and I do not know the best moment to do so. I have encountered bad function call errors and other strange faults (on an older version of the library). I have also tried to only capture weak pointers to avoid cycles but the result is still not perfect.

What I also do not understand is the undocumented feature state->setTerminated(); instead of close(). A return code cannot be passed this way, it seems.

Could close() itself be a problem, as a close event on the downstream side of the proxy uses close() for the upstream websocket? If this is non-blocking, would there be enough time to close the upstream before the downstream socket is destructed? In general, I fail to grasp which parts of the library are blocking, non-blocking, buffered or unbuffered and in which thread it is safe to call close(), stop(), state->setTerminated() or the destructor ~WebSocket.

Thanks in advance. If there is anything I can test for you, I will be glad to assist.

bsergean commented 7 months ago

Thanks for the good words.

Did you check if you can use server.setOnClientMessageCallback ?

iirc I had created that to avoid a memory leak.

bsergean commented 7 months ago

ps: I think I had tried to use setOnClientMessageCallback in that proxy code, but it was not working ...

martyharris commented 7 months ago

I chose the old API is for the following reason: I required a weak_ptr to the WebSocket instead of a reference so that I could asynchronously send periodic messages to the client outside the callback, in another thread. Previously, when storing a copy of the reference, the WebSocket would (of course) suddenly disappear due to a Close event and there was no good way to mutex lock this between the threads*. The weak_ptr to the WebSocket solved my problem, as this is checkable.

Next is the problem of capturing the connection state generated by the factory by the 'upstream' and 'downstream' WebSockets. I have double-checked and capturing the state, where the state contains a WebSocket for the upstream, generates a memory leak - the state is never destructed. Again I solved this by first making a weak_ptr to the connection state and capturing this in both callback lambda functions. The state is now destructable as there are no longer any cycles.

Unfortunately I was now presented with frequent crashes due to the std::system_error 'Resource deadlock avoided'. Digging into this further, I traced the problem. As the connection state was now nicely destructed (contrary to your proxy server) the WebSocket member contained in the state was now also destructed and ~WebSocket() joins its worker thread with the caller, through stop(). In the stop() code I found this pattern:

if (_thread.joinable()) _thread.join();

It took me at least ten times reading the std::thread docs on joinable() to realise that this pattern does not prevent you from joining yourself. joinable() is true if the thread has run and nothing more. It does not check if you are calling from the same thread that you are joining. You might have misread the docs just like I did. The check performed by the function is:

get_id() != std::thread::id()

This reads as if thread IDs are being compared, but the current ID (self) is being compared to the default-constructed id of a non-running thread. In other words, the check is really whether the thread is running, or has run.

The problem is having a shared_ptr to the connection state, the final thread decrementing the usage count to zero is the one calling the destructor. I found out that sometimes it was the upstream websocket that was the last one to do so, which triggered the Resource deadlock avoided, because it would join itself.

My current work-around is to introduce a new garbage collector thread that also has a reference to the connection state in a std::list. It is this thread that periodically checks whether there is only one reference left to the state, in which case it chooses to destruct the connection by erasing it from the list. This always works because it is a different thread.

I would like to hear from you to learn if there is a more elegant solution.

*) My attempt to have the Close event wait for a sendBinary() in progress in another thread was complicated when I found out that a failing sendBinary() can also call Close directly from the same callstack/thread, which makes the locking very ugly, to avoid yet more deadlocks.

bsergean commented 7 months ago

Here's one more complex example I had in this repo at some point ; https://github.com/machinezone/IXWebSocket/tree/v11.0.0/ixsnake/ixsnake

It was possibly suffering from the same problem you had, but I had it working with thread sanitizer, but it was still possibly leaking.

I know it took me a while to get something relatively stable, with this garbage collection thread, so I believe that you extra garbage collecting thread is a good idea. If you'd like to make a PR to add it to this repo I would consider it.

The caveat is that managing an std::list isn't ideal for some operations, but this server has already constraints so I think it's acceptable.

The code that makes things more complicated is the automatic reconnection thing, but it's a very useful feature.