Closed rjwills28 closed 2 years ago
Debugging Debugging the application I realised that after closing the page, some of the 100 PVs do get unsubscribed, but then it stops and no more get unsubscribed. In this time memory increases. Eventually all remaining PVs are unsubscribed, presumably in some clean-up mechanism, and memory stops increasing. I ran further tests on both client and server implementations. This issue does not occur when using the 'old' websocket protocol with subscription-transport-ws. Yet I could not find any issues with the actual Coniql server code or the updated cs-web-proto client...
Issue I finally looked into the Strawberry source code that is specific to the new graphql-ws protocol. I found the issue in https://github.com/strawberry-graphql/strawberry/blob/main/strawberry/subscriptions/protocols/graphql_transport_ws/handlers.py. This is what is happening as far as I can tell:
ConnectionResetError
as the thread continues to send messages to the websocket that has been closed. This is not explicitly caught as a ConnectionResetError
and instead is caught as a BaseException
here. In doing so, that subscription is deleted from the dictionary of live subscriptions that is internally tracking what subscriptions are active. When the code is running its clean up, it cannot find this subscription task to close properly, which leaves it untracked but still active. Eventually it is closed after some timeout.Solution
We will require a fix in the Strawberry source code to solve this issue. The fix I implemented involved catching the ConnectionResetError
and allowing it to pass as we know that the subscription is still being managed and so will be properly cleaned up in this process. The other option is to not delete the subscriptions when a BaseException
is caught but I'm not sure what impact this might have on other websocket errors. Developers of Strawberry may be able to give some input here.
async def operation_task(
self, result_source: AsyncGenerator, operation_id: str
) -> None:
"""
Operation task top level method. Cleans up and de-registers the operation
once it is done.
"""
try:
await self.handle_async_results(result_source, operation_id)
-> except ConnectionResetError:
-> pass
except BaseException: # pragma: no cover
# cleanup in case of something really unexpected
del self.subscriptions[operation_id]
del self.tasks[operation_id]
raise
Also note that I think this issue only occurs when testing against PVs updating at 10Hz. The ConnectionResetError
is thrown because there is a race condition between the subscription being shut down and the application trying to write the new value to the closed websocket.
If all subscriptions can be closed before an update to one of the PVs comes in then we would not see this problem.
Just to note what we discussed today.
I think you should proceed with a PR submission to strawberry-graphql. It would be most effective if you can make a minimal reproducer. The key characteristic is the high rate of events on a large number of subscriptions that triggers the race condition.
While I was trying to create a simplified case of this issue to give to the Strawberry contributors, I actually realised what the core problem is.
When the ConnectionResetError
exception gets thrown, the Strawberry handler just deletes the generator from it's internal dictionary and forgets about it. Deleting the generator and not explicitly closing it means that the finally
statement in the try-finally never gets called. In our Coniql code, this is where we unsubscribe from the PV monitors. The finally
statement will eventually be called when the garbage collector comes along, which is why we see these unsubscribes happen later than expected.
I have raised an issue on the Strawberry repo and provided them with a simple demo server and client.
This issue has been fixed with the following PR in the Strawberry repo: https://github.com/strawberry-graphql/strawberry/pull/2141.
The fix is available in the 0.127.4 release on PyPi.
I discovered an issue when testing the cs-web-proto performance page against the new strawberry Coniql implementation using the new websocket protocol. When closing the cs-web-proto webpage I noticed that the Coniql task's CPU does not drop to 0 immediately and in this time the memory usage continues to go up. Eventually CPU goes to 0 and memory stops increasing.
Details to reproduce: