openwallet-foundation / acapy

ACA-Py is a foundation for building decentralized identity applications and services running in non-mobile environments.
https://aca-py.org
Apache License 2.0
412 stars 512 forks source link

Race condition in issue-credential v1.0 leads to credential exchange switching to abandoned state #2000

Closed markuskreusch closed 8 months ago

markuskreusch commented 1 year ago

We have a test system running that connects two ACA-Py instances and triggers issuance of credentials using the issue-credential v1.0 protocol. During load tests we have around 1% of all credential exchanges fail.

Some more details:

We found out that this problem can be mitigated by waiting some seconds before doing the issue http call and assumed thus, that this is a race condition. Further analysis seems to confirm this:

swcurran commented 1 year ago

FYI - @ianco -- this sounds familiar.

ianco commented 1 year ago

@swcurran ... Yes, similar to the mediator issue we uncovered ... we need to look at where we're sending messages vs committing updates to the exchange records, and also what error checking we're doing on the state when we receive a message (or invoke one of the admin api endpoints) ...

ianco commented 1 year ago

OK not exactly like the mediator issue, but also looks like a general problem across basically all protocols.

In the base_record (https://github.com/hyperledger/aries-cloudagent-python/blob/main/aries_cloudagent/messaging/models/base_record.py#L389) - the emit_event() happens before the transaction is committed (which I think is what eventually triggers the webhook).

Ideally any events that are triggered within a protocol shouldn't actually be "emitted" and acted upon until the protocol step completes.

Also the specific error of "exchange is in wrong state" shouldn't trigger the exchange to be pushed into an abandoned state or issue a problem report.

ianco commented 1 year ago

Labelling this as High Priority because I think this needs to be addressed for a 1.0.0 release

swcurran commented 8 months ago

Need to relook at this issue.

swcurran commented 8 months ago

@ianco -- could you look at this again, please? As issue-credential-1.0 is deprecated, an issue local to that particular concern is not that big a deal. However, if it is a broader issue, we need to know that, and to characterize both the problem and what actions we should take. Not looking for it to be fixed (yet) -- just a definition of the problem, its potential impact and suggestions.

Note that I didn't read through this full issue, so the information might already be all here. If so, summarizing it in one comment, or in an ACA-Pug presentation might be sufficient for this request. Let me know.

ianco commented 8 months ago

I'll take a look. Reading through the comments I think the solution is pretty straightforward. I'll fix the 1.0 protocol and double check the other protocols ...

ianco commented 8 months ago

However, if it is a broader issue, we need to know that ...

I believe this is a bigger problem. The emit() function (which triggers the webhook) happens as each record is added, updated or deleted, so in all cases the webhook will be called before the record is saved. So we have potential race conditions pretty much everywhere.

My suggestion ... the problem is in the following code, in https://github.com/hyperledger/aries-cloudagent-python/blob/main/aries_cloudagent/messaging/models/base_record.py:

    async def post_save(
        self,
        session: ProfileSession,
        new_record: bool,
        last_state: Optional[str],
        event: bool = None,
    ):
        """Perform post-save actions.

        Args:
            session: The profile session to use
            new_record: Flag indicating if the record was just created
            last_state: The previous state value
            event: Flag to override whether the event is sent
        """

        if event is None:
            event = new_record or (last_state != self.state)
        if event:
            await self.emit_event(session, self.serialize())

    async def delete_record(self, session: ProfileSession):
        """Remove the stored record.

        Args:
            session: The profile session to use
        """

        if self._id:
            storage = session.inject(BaseStorage)
            if self.state:
                self._previous_state = self.state
                self.state = BaseRecord.STATE_DELETED
                await self.emit_event(session, self.serialize())
            await storage.delete_record(self.storage_record)

    async def emit_event(self, session: ProfileSession, payload: Any = None):
        """Emit an event.

        Args:
            session: The profile session to use
            payload: The event payload
        """

        if not self.RECORD_TOPIC:
            return

        if self.state:
            topic = f"{self.EVENT_NAMESPACE}::{self.RECORD_TOPIC}::{self.state}"
        else:
            topic = f"{self.EVENT_NAMESPACE}::{self.RECORD_TOPIC}"

        if not payload:
            payload = self.serialize()

        await session.profile.notify(topic, payload)

(There are other scenarios as well in this code.)

Note that in the delete() scenario the emit() is called before the record is saved, however in the post_save() scenario it happens before. In either case, the commit() happens later on.

My suggestion is to move the emit() code into the commit() method in the SessionProfile class: https://github.com/hyperledger/aries-cloudagent-python/blob/main/aries_cloudagent/core/profile.py

This would ensure that all notifications happen after the transaction is committed (and database updates are updated) however it may have other side effects. I can give this a try but wanted to get some feedback ...

@dbluhm @swcurran @shaangill025 any thoughts?

ianco commented 8 months ago

My suggestion is to move the emit() code into the commit() method in the SessionProfile class: https://github.com/hyperledger/aries-cloudagent-python/blob/main/aries_cloudagent/core/profile.py

This would ensure that all notifications happen after the transaction is committed (and database updates are updated) however it may have other side effects. I can give this a try but wanted to get some feedback ...

swcurran commented 8 months ago

@ianco - could you prepare a short session on this for the ACA-Pug meeting tomorrow. I’d like this to have a higher profile, since you haven’t gotten feedback. This is out of my realm.

ianco commented 8 months ago

@ianco - could you prepare a short session on this for the ACA-Pug meeting tomorrow. I’d like this to have a higher profile, since you haven’t gotten feedback. This is out of my realm.

Yep will do. I have a fix so I'll open a PR. I'm having trouble duplicating the issue though so can't (yet) verify the fix.

PR #2760

swcurran commented 8 months ago

Fixed by #2760

ianco commented 8 months ago

@markuskreusch