Closed WadeBarnes closed 4 years ago
This issue was initial discovered yesterday. Overnight, under load, the agents (21 Issuers, 17 receivers) got into the same state which all but stopped credential processing.
@WadeBarnes The first error is fixed in the refactoring branch (bad path for the problem report module). The second one is odd, when ICIA doesn't get a response to a credential offer does it send a message with the same thread ID?
@andrewwhitehead, @ianco may be able to confirm.
Summary
It would appear agents can get into a bad state and continue to run in that state, stopping traffic from reaching the receiving controller.
When agents get into a bad state, they either need to recover or stop processing traffic. If it's possible to implement a health check on the agents that monitors such conditions we could wire it up in OpenShift and have the pods taken out of service.
We can monitor:
Details
Using
bcgovimages/aries-cloudagent:py36-1.11-1_0.4.0-alpha
Pod Configuration:
The following issues appear after several hours of sustained load.
Seeing a lot of these errors in the ICIA (Indy Catalyst Issuer Agent) Agents;
Seeing a lot of these errors in the ICOB (Indy Catalyst OrgBook) Agents;