Proof requests failing - Githubissues

openwallet-foundation / acapy

ACA-Py is a foundation for building decentralized identity applications and services running in non-mobile environments.

https://aca-py.org

Apache License 2.0

421 stars 515 forks source link

Proof requests failing #684

Closed WadeBarnes closed 4 years ago

WadeBarnes commented 4 years ago

In the OrgBook we're seeing credential verification calls (https://orgbook.gov.bc.ca/api/credential/1/verify) fail every so often. This is causing downtime to be reported by the monitor watching the verification api.

At the API level this is reported as the server (the agent) disconnecting without response. On the agent side the time stamps seems to line up with panic errors (thread '<unnamed>' panicked at 'called Result::unwrap() on an Err value: Error(None)', src/libcore/result.rs:1165:5) which result in the affected agent pods crashing or not responding.

The sample logs also indicate some errors calling back to the controller's (api's) /agentcb/topic/present_proof/ endpoint.

Logs from the affected pods for review: api-indy-cat-18-n8hq2.log agent-indy-cat-19-7tbld.log agent-indy-cat-20-4jh7p.log agent-indy-cat-20-5kmk8.log agent-indy-cat-20-9j2tn.log agent-indy-cat-20-pppgx.log agent-indy-cat-21-7ddzq.log

WadeBarnes commented 4 years ago

A couple more logs showing the agents crashing with the poison/panic error during an call to /agentcb/topic/present_proof/;

agent-indy-cat-22-nkt57.log agent-indy-cat-22-7r7ft.log

WadeBarnes commented 4 years ago

Agent log with a back-trace on a panic error and crash. Not a proof request call, but it may help. agent-indy-cat-23-g4x7h.log

ianco commented 4 years ago

Couple of thoughts:

could be related to the agent pod losing connection to the database pod (similar to the other gateway timeouts we've seen)?
could be some kind of race condition - the agent calls the api to issue a credential; the api does a call-back to the agent to tell the agent to save the credential in the wallet ... depending on timing the two requests could step on each other?

ianco commented 4 years ago

After digging into the Rust code a bit, I'm not sure there's much we can do on that side. "PoisonError" means the application has gotten into a corrupted state, so even if we try to handle the error all we can do is crash/restart.

One option is to upgrade the postgres and r2d connection pool libraries to the latest versions (they may have addresses an underlying condition that is causing the error) however the latest libraries have introduced non-backward-compatibility so there is some coding involved.

Recommendation is to update the retry logic in the controller to handle this condition (wait and retry) and let openshift recycle the offending pods.