irods / irods_rule_engine_plugin_audit_amqp

BSD 3-Clause "New" or "Revised" License
2 stars 13 forks source link

plugin behaviour when rabbitMQ target is down #108

Closed kript closed 1 year ago

kript commented 1 year ago

Hi folks,

We recently had a ~disk full~ incident on our test RabbitMQ system which meant that the audit plugin installed and configured on our dev zones couldn't reach a rabbitMQ system to report the audited PEP's.

We discovered the following things;

  1. we were still able to iget and iput files 👍
  2. The logs were not helpful, as they contained just lots of
send: Broken pipe
send: Connection refused
recv: Connection refused
send: Broken pipe

this was happening quite often;

$ grep -c recv /var/lib/irods/log/rodsLog.2022.09.26
6944247
$ grep -c "send:" /var/lib/irods/log/rodsLog.2022.09.26
7691342

As soon as the RabbitMQ service was restored the odd messages went away. however the system did not have any queues, so it wasn't processing any of the messages. it would be helpful to have a way to know the message was acknowledged by the RAbbitMQ system (perhaps a debug setting in the options?).

Can we have the plugin log something more helpful please? It required debugging from first principles to find the cause as we also had the Indexing and tiering plugins installed and had also made a minor database change...

N.B. This is tangentially related to #106

trel commented 1 year ago

Thanks for the report.

Yes, we need to make the plugin a lot more helpful / resilient in this scenario.

alanking commented 1 year ago

When I killed the message broker (in my case, ActiveMQ 5.14) with a 4.3.0 server and the audit plugin as of #118, I get a message like this in the logs:

{
  "error_condition::description": "Connection refused - on read from localhost:5672",
  "error_condition::name": "proton:io",
  "error_condition::what": "proton:io: Connection refused - on read from localhost:5672",
  "log_category": "rule_engine",
  "log_facility": "local0",
  "log_level": "error",
  "log_message": "Transport error in proton messaging handler",
  "request_api_name": "GENERAL_ADMIN_AN",
  "request_api_number": 701,
  "request_api_version": "d",
  "request_client_user": "rods",
  "request_host": "192.168.16.3",
  "request_proxy_user": "rods",
  "request_release_version": "rods4.3.0",
  "rule_engine_plugin": "audit_amqp",
  "server_host": "d6b557785867",
  "server_pid": 1063015,
  "server_timestamp": "2023-06-07T18:30:36.354Z",
  "server_type": "agent"
}

I think this provides all the information that was missing before (the server, the plugin name, more detailed messages, etc.). Perhaps this is resolved?

korydraughn commented 1 year ago

I believe this does resolve the issue.

Please assign appropriate labels and developer to issue before closing.

SwooshyCueb commented 1 year ago

Oh yeah, I did add a whole bunch more logging in my first refactor PR. I didn't know about this issue (or had forgotten about it) or I'd have tagged it in the commits.

trel commented 1 year ago

Looks like this was handled by https://github.com/irods/irods_rule_engine_plugin_audit_amqp/commit/2b3a39806d9967a703ad39f8739d68025ab57259 from https://github.com/irods/irods_rule_engine_plugin_audit_amqp/pull/105.