Agents can get into a bad state and not recover

WadeBarnes commented 4 years ago

Summary

It would appear agents can get into a bad state and continue to run in that state, stopping traffic from reaching the receiving controller.

When agents get into a bad state, they either need to recover or stop processing traffic. If it's possible to implement a health check on the agents that monitors such conditions we could wire it up in OpenShift and have the pods taken out of service.

We can monitor:

readiness; the pod is ready to receive traffic. OpenShift will only route traffic to the pod when this check passes.
liveness; health condition of the pod (running, etc). OpenShift will kill the pod if this check fails a number of times (all configurable).

Details

Using bcgovimages/aries-cloudagent:py36-1.11-1_0.4.0-alpha

Pod Configuration:

1 - ICIA (Indy Catalyst Issuer Agent) Controller
21 - ICIA Agents
17 - ICOB (Indy Catalyst OrgBook) Agents
16 - ICOB Controllers
This configuration has been confirmed to support a maximum issuance rate of ~2100 credentials per minute.
Data set is 2,090,092‬ credentials.

The following issues appear after several hours of sustained load.

Seeing a lot of these errors in the ICIA (Indy Catalyst Issuer Agent) Agents;

ModuleNotFoundError: No module named 'aries_cloudagent.protocols.problem_report.messages'
During handling of the above exception, another exception occurred:

Traceback (most recent call last):
  File "/home/indy/.pyenv/versions/3.6.9/lib/python3.6/site-packages/aries_cloudagent/transport/inbound/http.py", line 98, in inbound_message_handler
    await self.message_router(body, self._scheme, single_response=response)
  File "/home/indy/.pyenv/versions/3.6.9/lib/python3.6/site-packages/aries_cloudagent/conductor.py", line 318, in inbound_message_router
    parsed_msg, delivery, connection, self.outbound_message_router
  File "/home/indy/.pyenv/versions/3.6.9/lib/python3.6/site-packages/aries_cloudagent/dispatcher.py", line 65, in dispatch
    message = await self.make_message(parsed_msg)
  File "/home/indy/.pyenv/versions/3.6.9/lib/python3.6/site-packages/aries_cloudagent/dispatcher.py", line 132, in make_message
    message_cls = registry.resolve_message_class(message_type)
  File "/home/indy/.pyenv/versions/3.6.9/lib/python3.6/site-packages/aries_cloudagent/messaging/protocol_registry.py", line 91, in resolve_message_class
    msg_cls = ClassLoader.load_class(msg_cls)
  File "/home/indy/.pyenv/versions/3.6.9/lib/python3.6/site-packages/aries_cloudagent/classloader.py", line 141, in load_class
    raise ModuleLoadError(f"Unable to import module: {mod_path}")
aries_cloudagent.classloader.ModuleLoadError: Unable to import module: aries_cloudagent.protocols.problem_report.messages
2019-11-27 13:56:22,762 aries_cloudagent.transport.inbound.http ERROR Error handling message
Traceback (most recent call last):
  File "/home/indy/.pyenv/versions/3.6.9/lib/python3.6/site-packages/aries_cloudagent/classloader.py", line 139, in load_class
    mod = import_module(mod_path)
  File "/home/indy/.pyenv/versions/3.6.9/lib/python3.6/importlib/__init__.py", line 126, in import_module
    return _bootstrap._gcd_import(name[level:], package, level)
  File "<frozen importlib._bootstrap>", line 994, in _gcd_import
  File "<frozen importlib._bootstrap>", line 971, in _find_and_load
  File "<frozen importlib._bootstrap>", line 953, in _find_and_load_unlocked

Seeing a lot of these errors in the ICOB (Indy Catalyst OrgBook) Agents;

2019-11-27 14:03:15,210 asyncio ERROR Task exception was never retrieved
future: <Task finished coro=<Conductor.complete_dispatch() done, defined at /home/indy/.pyenv/versions/3.6.9/lib/python3.6/site-packages/aries_cloudagent/conductor.py:322> exception=StorageDuplicateError('Multiple records located',)>
Traceback (most recent call last):
  File "/home/indy/.pyenv/versions/3.6.9/lib/python3.6/site-packages/aries_cloudagent/conductor.py", line 324, in complete_dispatch
    await dispatch
  File "/home/indy/.pyenv/versions/3.6.9/lib/python3.6/site-packages/aries_cloudagent/protocols/credentials/handlers/credential_issue_handler.py", line 35, in handle
    context.message
  File "/home/indy/.pyenv/versions/3.6.9/lib/python3.6/site-packages/aries_cloudagent/protocols/credentials/manager.py", line 302, in receive_credential
    self.context, credential_message._thread_id, "external"
  File "/home/indy/.pyenv/versions/3.6.9/lib/python3.6/site-packages/aries_cloudagent/protocols/credentials/models/credential_exchange.py", line 115, in retrieve_by_thread_and_initiator
    context, {"thread_id": thread_id}, {"initiator": initiator}
  File "/home/indy/.pyenv/versions/3.6.9/lib/python3.6/site-packages/aries_cloudagent/messaging/models/base_record.py", line 236, in retrieve_by_tag_filter
    raise StorageDuplicateError("Multiple records located")
aries_cloudagent.storage.error.StorageDuplicateError: Multiple records located
2019-11-27 14:03:15,666 aries_cloudagent.task_processor WARNING Task failed: <aries_cloudagent.task_processor.PendingTask object at 0x7f9108579ba8>
2019-11-27 14:03:15,666 asyncio ERROR Future exception was never retrieved
future: <Future finished exception=Exception('Unexpected response status',)>
Traceback (most recent call last):
  File "/home/indy/.pyenv/versions/3.6.9/lib/python3.6/site-packages/aries_cloudagent/task_processor.py", line 14, in delay_task
    return await task
  File "/home/indy/.pyenv/versions/3.6.9/lib/python3.6/site-packages/aries_cloudagent/transport/outbound/manager.py", line 224, in dispatch_message
    await transport.handle_message(message)
  File "/home/indy/.pyenv/versions/3.6.9/lib/python3.6/site-packages/aries_cloudagent/transport/outbound/http.py", line 55, in handle_message
    raise Exception("Unexpected response status")
Exception: Unexpected response status
2019-11-27 14:03:31,124 aiohttp.server ERROR Unhandled exception
Traceback (most recent call last):
  File "/home/indy/.pyenv/versions/3.6.9/lib/python3.6/site-packages/aiohttp/web_protocol.py", line 447, in start
    await resp.prepare(request)
  File "/home/indy/.pyenv/versions/3.6.9/lib/python3.6/site-packages/aiohttp/web_response.py", line 353, in prepare
    return await self._start(request)
  File "/home/indy/.pyenv/versions/3.6.9/lib/python3.6/site-packages/aiohttp/web_response.py", line 667, in _start
    return await super()._start(request)
  File "/home/indy/.pyenv/versions/3.6.9/lib/python3.6/site-packages/aiohttp/web_response.py", line 410, in _start
    await writer.write_headers(status_line, headers)
  File "/home/indy/.pyenv/versions/3.6.9/lib/python3.6/site-packages/aiohttp/http_writer.py", line 112, in write_headers
    self._write(buf)
  File "/home/indy/.pyenv/versions/3.6.9/lib/python3.6/site-packages/aiohttp/http_writer.py", line 67, in _write
    raise ConnectionResetError('Cannot write to closing transport')
ConnectionResetError: Cannot write to closing transport
2019-11-27 14:03:39,327 asyncio ERROR Task exception was never retrieved
future: <Task finished coro=<Conductor.complete_dispatch() done, defined at /home/indy/.pyenv/versions/3.6.9/lib/python3.6/site-packages/aries_cloudagent/conductor.py:322> exception=StorageDuplicateError('Multiple records located',)>
Traceback (most recent call last):
  File "/home/indy/.pyenv/versions/3.6.9/lib/python3.6/site-packages/aries_cloudagent/conductor.py", line 324, in complete_dispatch
    await dispatch
  File "/home/indy/.pyenv/versions/3.6.9/lib/python3.6/site-packages/aries_cloudagent/protocols/credentials/handlers/credential_issue_handler.py", line 35, in handle
    context.message
  File "/home/indy/.pyenv/versions/3.6.9/lib/python3.6/site-packages/aries_cloudagent/protocols/credentials/manager.py", line 302, in receive_credential
    self.context, credential_message._thread_id, "external"
  File "/home/indy/.pyenv/versions/3.6.9/lib/python3.6/site-packages/aries_cloudagent/protocols/credentials/models/credential_exchange.py", line 115, in retrieve_by_thread_and_initiator
    context, {"thread_id": thread_id}, {"initiator": initiator}
  File "/home/indy/.pyenv/versions/3.6.9/lib/python3.6/site-packages/aries_cloudagent/messaging/models/base_record.py", line 236, in retrieve_by_tag_filter
    raise StorageDuplicateError("Multiple records located")
aries_cloudagent.storage.error.StorageDuplicateError: Multiple records located

WadeBarnes commented 4 years ago

This issue was initial discovered yesterday. Overnight, under load, the agents (21 Issuers, 17 receivers) got into the same state which all but stopped credential processing.

andrewwhitehead commented 4 years ago

@WadeBarnes The first error is fixed in the refactoring branch (bad path for the problem report module). The second one is odd, when ICIA doesn't get a response to a credential offer does it send a message with the same thread ID?

WadeBarnes commented 4 years ago

@andrewwhitehead, @ianco may be able to confirm.

hyperledger / aries-cloudagent-python

Agents can get into a bad state and not recover #280

Summary

Details