openwallet-foundation / acapy

ACA-Py is a foundation for building decentralized identity applications and services running in non-mobile environments.
https://aca-py.org
Apache License 2.0
419 stars 512 forks source link

🐛 DIDRotate Hangup can sporadically raise 500 error #3306

Closed ff137 closed 4 weeks ago

ff137 commented 1 month ago

The hangup method in the DIDRotateManager is defined as follows:

    async def hangup(self, conn: ConnRecord) -> Hangup:
        """Hangup the connection.

        Args:
            conn (ConnRecord): The connection to hangup.
        """

        hangup = Hangup()

        async with self.profile.session() as session:
            await conn.delete_record(session)

        responder = self.profile.inject(BaseResponder)
        await responder.send(hangup, connection_id=conn.connection_id)

        return hangup

As you can see, we call await conn.delete_record(session) before we submit the hangup request to the Responder. As far as I can tell, deleting the record before hanging up is the cause of the following unhandled exception:

Handler error: BaseConnectionManager.get_connection_targets
Handler error: upgrade_middleware
Handler error with exception: Record not found: connection/df4accad-96d9-41fa-b534-a6400966d270
=================
Traceback (most recent call last):
  File "/home/aries/.local/lib/python3.12/site-packages/aries_cloudagent/admin/server.py", line 141, in ready_middleware
    return await handler(request)
    ^^^^^^^^^^^^^^^^^^^^^^
...
  File "/home/aries/.local/lib/python3.12/site-packages/aries_cloudagent/protocols/did_rotate/v1_0/routes.py", line 101, in hangup
    msg = await did_rotate_mgr.hangup(conn)
    ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
  File "/home/aries/.local/lib/python3.12/site-packages/aries_cloudagent/protocols/did_rotate/v1_0/manager.py", line 86, in hangup
    await responder.send(hangup, connection_id=conn.connection_id)
  File "/home/aries/.local/lib/python3.12/site-packages/aries_cloudagent/messaging/responder.py", line 104, in send
    return await self.send_outbound(
    ^^^^^^^^^^^^^^^^^^^^^^^^^
  File "/home/aries/.local/lib/python3.12/site-packages/aries_cloudagent/admin/server.py", line 108, in send_outbound
    return await self._send(profile, message)
    ^^^^^^^^^^^^^^^^^^^^^^^^^^^
  File "/home/aries/.local/lib/python3.12/site-packages/aries_cloudagent/core/conductor.py", line 665, in outbound_message_router
    raise StorageNotFoundError(f"Record not found: {record_type}/{record_id}")
aries_cloudagent.storage.error.StorageNotFoundError: Record not found: connection/df4accad-96d9-41fa-b534-a6400966d270
Error handling request

This internal server error occurs sporadically / inconsistently, and so it appears to be a race condition based on how quickly the deleted record reflects in storage, after which it cannot be used in outbound_message_router to complete the hangup.

A subsequent retry of the hangup, after the 500, will then yield a 404, as record has been deleted.

The method should be made resilient so that exceptions are handled, and retrying is possible if necessary.

My guess is that await conn.delete_record(session) should just be moved after the hangup is sent to the responder, but that may be naive, as I'm not sure. I'll test a solution in the next couple days, but feedback is welcome in the meantime.

Edit: first observed with v1.0.0, but also with latest release