rjwills28 / coniql_strawberry

A test implementation of the Coniql application using Strawberry
0 stars 0 forks source link

Memory leak when unsubscribing from PVs #2

Closed rjwills28 closed 2 years ago

rjwills28 commented 2 years ago

I discovered an issue when testing the cs-web-proto performance page against the new strawberry Coniql implementation using the new websocket protocol. When closing the cs-web-proto webpage I noticed that the Coniql task's CPU does not drop to 0 immediately and in this time the memory usage continues to go up. Eventually CPU goes to 0 and memory stops increasing.

Details to reproduce:

rjwills28 commented 2 years ago

Debugging Debugging the application I realised that after closing the page, some of the 100 PVs do get unsubscribed, but then it stops and no more get unsubscribed. In this time memory increases. Eventually all remaining PVs are unsubscribed, presumably in some clean-up mechanism, and memory stops increasing. I ran further tests on both client and server implementations. This issue does not occur when using the 'old' websocket protocol with subscription-transport-ws. Yet I could not find any issues with the actual Coniql server code or the updated cs-web-proto client...

Issue I finally looked into the Strawberry source code that is specific to the new graphql-ws protocol. I found the issue in https://github.com/strawberry-graphql/strawberry/blob/main/strawberry/subscriptions/protocols/graphql_transport_ws/handlers.py. This is what is happening as far as I can tell:

Solution We will require a fix in the Strawberry source code to solve this issue. The fix I implemented involved catching the ConnectionResetError and allowing it to pass as we know that the subscription is still being managed and so will be properly cleaned up in this process. The other option is to not delete the subscriptions when a BaseException is caught but I'm not sure what impact this might have on other websocket errors. Developers of Strawberry may be able to give some input here.

async def operation_task(
    self, result_source: AsyncGenerator, operation_id: str
) -> None:
    """
    Operation task top level method.  Cleans up and de-registers the operation
    once it is done.
    """
    try:
        await self.handle_async_results(result_source, operation_id)
->  except ConnectionResetError:
->      pass
    except BaseException:  # pragma: no cover
        # cleanup in case of something really unexpected
        del self.subscriptions[operation_id]
        del self.tasks[operation_id]
        raise
rjwills28 commented 2 years ago

Also note that I think this issue only occurs when testing against PVs updating at 10Hz. The ConnectionResetError is thrown because there is a race condition between the subscription being shut down and the application trying to write the new value to the closed websocket. If all subscriptions can be closed before an update to one of the PVs comes in then we would not see this problem.

aawdls commented 2 years ago

Just to note what we discussed today.

I think you should proceed with a PR submission to strawberry-graphql. It would be most effective if you can make a minimal reproducer. The key characteristic is the high rate of events on a large number of subscriptions that triggers the race condition.

rjwills28 commented 2 years ago

While I was trying to create a simplified case of this issue to give to the Strawberry contributors, I actually realised what the core problem is. When the ConnectionResetError exception gets thrown, the Strawberry handler just deletes the generator from it's internal dictionary and forgets about it. Deleting the generator and not explicitly closing it means that the finally statement in the try-finally never gets called. In our Coniql code, this is where we unsubscribe from the PV monitors. The finally statement will eventually be called when the garbage collector comes along, which is why we see these unsubscribes happen later than expected.

I have raised an issue on the Strawberry repo and provided them with a simple demo server and client.

rjwills28 commented 2 years ago

This issue has been fixed with the following PR in the Strawberry repo: https://github.com/strawberry-graphql/strawberry/pull/2141.

The fix is available in the 0.127.4 release on PyPi.