Closed WadeBarnes closed 4 years ago
A couple more logs showing the agents crashing with the poison/panic error during an call to /agentcb/topic/present_proof/
;
Agent log with a back-trace on a panic error and crash. Not a proof request call, but it may help. agent-indy-cat-23-g4x7h.log
Couple of thoughts:
After digging into the Rust code a bit, I'm not sure there's much we can do on that side. "PoisonError" means the application has gotten into a corrupted state, so even if we try to handle the error all we can do is crash/restart.
One option is to upgrade the postgres and r2d connection pool libraries to the latest versions (they may have addresses an underlying condition that is causing the error) however the latest libraries have introduced non-backward-compatibility so there is some coding involved.
Recommendation is to update the retry logic in the controller to handle this condition (wait and retry) and let openshift recycle the offending pods.
In the OrgBook we're seeing credential verification calls (https://orgbook.gov.bc.ca/api/credential/1/verify) fail every so often. This is causing downtime to be reported by the monitor watching the verification api.
At the API level this is reported as the server (the agent) disconnecting without response. On the agent side the time stamps seems to line up with panic errors (
thread '<unnamed>' panicked at 'called Result::unwrap() on an Err value: Error(None)', src/libcore/result.rs:1165:5
) which result in the affected agent pods crashing or not responding.The sample logs also indicate some errors calling back to the controller's (api's)
/agentcb/topic/present_proof/
endpoint.Logs from the affected pods for review: api-indy-cat-18-n8hq2.log agent-indy-cat-19-7tbld.log agent-indy-cat-20-4jh7p.log agent-indy-cat-20-5kmk8.log agent-indy-cat-20-9j2tn.log agent-indy-cat-20-pppgx.log agent-indy-cat-21-7ddzq.log