hyperledger-archives / aries-framework-dotnet

Aries Framework .NET for building multiplatform SSI services
https://wiki.hyperledger.org/display/aries
Apache License 2.0
84 stars 74 forks source link

Mediator crashes on back to back getInboxItemsMessage calls #163

Open Alexis-Falquier opened 3 years ago

Alexis-Falquier commented 3 years ago

We recently came across an issue with our deployment of our mediator using the dotnet framework where the mediator crashes and even corrupts the wallet state when two calls to fetch the inbox are made back to back.

The mediator breaks when it receives two subsequent synchronous requests (with response_type = all, so it has to send a response back) of the same type, where the second request was sent before the mediator could handle the first one and send a response to the first one. So the issue would happen with something like this:

Request1 -> Request2 (1s delay) -> Response1 -> Response2 (Handling of this crashes the mediator and corrupts the wallet state)

Note: the call is from the same agent polling its own inbox twice before receiving a response

The call in question that is being handled is the getInboxItemsMessage https://github.com/HarshitaAggrawal/Hyperledger/blob/9f2809db889b5b317d35863c628e5be097d95d92/src/Hyperledger.Aries.Routing.Mediator/Handlers/RoutingInboxHandler.cs#L113

Once this error is faced at first mediator is not reachable but when it is brought back up no calls return correctly from the mediator as the wallet seems to have been corrupted where the mediator will consistently have this error:

Status code 500
Hyperledger.Indy.WalletApi.WalletNotFoundException: The wallet does not exist.
   at Hyperledger.Aries.Storage.DefaultWalletService.GetWalletAsync(WalletConfiguration configuration, WalletCredentials credentials) in /app/aries-framework-dotnet/src/Hyperledger.Aries/Storage/DefaultWalletService.cs:line 36
   at Hyperledger.Aries.Routing.RoutingInboxHandler.GetInboxItemsAsync(IAgentContext agentContext, ConnectionRecord connection, GetInboxItemsMessage getInboxItemsMessage) in /app/aries-framework-dotnet/src/Hyperledger.Aries.Routing.Mediator/Handlers/RoutingInboxHandler.cs:line 144
   at Hyperledger.Aries.Routing.RoutingInboxHandler.ProcessAsync(IAgentContext agentContext, UnpackedMessageContext messageContext) in /app/aries-framework-dotnet/src/Hyperledger.Aries.Routing.Mediator/Handlers/RoutingInboxHandler.cs:line 80
   at Hyperledger.Aries.Agents.AgentBase.ProcessMessage(IAgentContext agentContext, MessageContext messageContext) in /app/aries-framework-dotnet/src/Hyperledger.Aries/Agents/AgentBase.cs:line 143
   at Hyperledger.Aries.Agents.AgentBase.ProcessAsync(IAgentContext context, MessageContext messageContext) in /app/aries-framework-dotnet/src/Hyperledger.Aries/Agents/AgentBase.cs:line 113
   at Hyperledger.Aries.AspNetCore.AgentMiddleware.Invoke(HttpContext aHttpContext, IAgentProvider aAgentProvider) in /app/aries-framework-dotnet/src/Hyperledger.Aries.AspNetCore/AgentMiddleware.cs:line 67
   at Microsoft.AspNetCore.Diagnostics.DeveloperExceptionPageMiddleware.Invoke(HttpContext context)

Only after resetting the mediator and its wallet does it start working again. In trying to reproduce the bug we found that increasing the performance of the mediator host made it harder for the error to occur which points to a possible stack overflow issue. We mitigated the issue on our client code by improving the polling logic forcing to wait for a response between calls and as an added measure we are switching the mediator wallet from using SQLite to Postgres which we hope might avoid possible wallet corruption. We plan on stress testing the mediator and adding a backup for good measure. But all of these are mitigation plans where the core of the issue is somewhere in the handling of getInboxItemsMessage

Is this a known issue?

Alexis-Falquier commented 3 years ago

@tmarkovski Do you have any insight on this? After we switch to postgres we will be doing thorough testing to try to reproduce the bug as we have a high suspicion it was sqlite related where the multiple calls created some form of race condition that corrupted the wallet. ill update this issue when we do with the findings.

acuderman commented 3 years ago

@tmarkovski Postgres database didn't solve an issue since there is a race condition happening when opening the wallets. The issue was solved using semaphores. PR is opened and ready for review #169

juvebogdan commented 3 years ago

@Alexis-Falquier @tmarkovski Hello, Can I ask you about a maybe similar issue I am facing? If I send several FetchInboxAsync requests one after another in quick succession I get duplicate messages on the edge client. It seems that the mediator takes some time to delete it from the inbox. Did you face something like this? Is this something expected?