irods / irods_rule_engine_plugin_audit_amqp

BSD 3-Clause "New" or "Revised" License
2 stars 13 forks source link

Long-lived AMQP connection #107

Open SwooshyCueb opened 1 year ago

SwooshyCueb commented 1 year ago

At present, we make a new connection for every AMQP message we send. This is not ideal.

From the RabbitMQ documentation (emphasis theirs):

Before an application can use RabbitMQ, it has to open a connection to a RabbitMQ node. The connection then will be used to perform all subsequent operations. Connections are meant to be long-lived. Opening a connection for every operation (e.g. publishing a message) would be very inefficient and is highly discouraged.

This comes from AMQP 0-9-1-centric and .NET-centric documentation, but it still applies for us.

trel commented 1 year ago

Very interesting ... and kind of obvious now that it's linked :)

Suggests the server has to hold a connection longer-lived than an individual Agent.

Hmm.....

SwooshyCueb commented 1 year ago

I've been pondering this for a bit, actually. Was something I thought about working on as part of #105, but there's already so much stuff in there that I think I'd rather get a bow on it and do further refactoring of our AMQP/Proton usage as part of a separate effort.

korydraughn commented 1 year ago

I agree. Handling that in a separate PR feels like the right thing to do.

korydraughn commented 1 year ago

Potential Solution

Instead of sending one AMQP message per RabbitMQ connection, we can store the messages and send them at a later time. This enables the iRODS server to batch messages. This also means the server is less likely to lose messages if they were written down successfully.

Plugin Configuration

The audit plugin would grow two new configuration options:

High-Level Algorithm

The audit plugin would be changed to do the following:

  1. Attempt to write the audit information into the pre-allocated shared memory buffer.
  2. If inserting the audit information exceeds the shared_memory_size_in_bytes, flush the messages.
  3. If the message was inserted without error, check the flush_interval_in_seconds.
  4. If the flush_interval_in_seconds has been satisfied, flush the messages.
  5. If the flush_interval_in_seconds has not been satisfied, do nothing.

Flush the messages means ... opening and using a single RabbitMQ connection to send all messages stored in shared memory. Only one agent is allowed to do this.

This solution assumes access to the shared memory is synchronized across agents running on the same machine. This solution does not require zone-wide synchronization.

alanking commented 1 year ago

Would the messages also flush when the plugin's stop operation is invoked? Or is this being managed outside of the agents? "The server" terminology is tripping me up, I think.

korydraughn commented 1 year ago

Would the messages also flush when the plugin's stop operation is invoked?

That's a possibility. We'd need to investigate and see what the pros/cons are around the stop operation.

Or is this being managed outside of the agents?

It is all happening inside the agent. The good thing about this design is that it's flexible. The audit plugin could simply write things down and let another tool handle sending the messages. The design presented just makes it so that no other tool is necessary.

trel commented 1 year ago

Have to make sure we write down the new messages in the full buffer case... so...

flush_messages {
    send_all_messages_in_single_connection()
    set_last_flush_time()
}

# check the flush_interval_in_seconds
if enough seconds have passed:
    flush_messages()

# save new message
message_saved = 0
do while message_saved != 1:
    if !(save_message_to_buffer):
        flush_messages()
alanking commented 1 year ago

Oh, I see. The flushing is actually managed in the shared memory in addition to the messages to flush. Mental model was missing some screws, as per usual.

SwooshyCueb commented 1 year ago

I am personally not a fan of batching messages in this way. To me it feels like a janky, overcomplicated workaround for our ownership model. Instead, I propose we have a single connection per agent, opened in start(), reused in exec_rule(), and closed in stop().

trel commented 1 year ago

That seems reasonable and good.

And we'll still see a performance gain since an Agent can send hundreds of messages if all PEPs are configured to fire.

Will be fun to measure and compare.

wijnanjo commented 1 week ago

We adapted the irods plugin to send messages to kafka instead of over AMQP. Now we see a pretty high rate of new connections to kafka (coming from the plugin) which decreases the performance of our kafka broker.

Is it safe to create a singleton kafka client in the plugin (so a single client per irods process instead of per agent)? And if yes, how exactly? We're lacking C++ experience, so any help is very welcome.

Using a long-lived kafka client is a best practice, just like AMQP connections and the kafka client already takes care of message housekeeping like buffering and flushing.

korydraughn commented 1 week ago

If I'm understanding correctly, you're asking if a single kafka client can be per iRODS server rather than per agent. Does that sound correct?

If yes, then iRODS doesn't provide any mechanism for doing that, yet. There are ways around that limitation though. Without going into detail, here is one way to deal with it.

With that said, both of those solutions could prove challenging depending on your C++ experience.

wijnanjo commented 1 week ago

If I'm understanding correctly, you're asking if a single kafka client can be per iRODS server rather than per agent. Does that sound correct?

That's correct and pretty important since the irods server spawns a lot of agents (mainly due to Globus transfers creating a connection for each file to be transferred, if I'm not mistaken). We already implemented your second suggestion (that is created kafka client in plugin.start and destoy in plugin.stop). But that only partly solves our issue.

We'll go for a proxy solution.

trel commented 1 week ago

Hi @wijnanjo, would love to know where you are doing this work. Please consider making contact at info@irods.org if mentioning here is not desired. Thanks.

trel commented 4 days ago

new 5.0 process model

or

a completely separate binary/service (k8s?)


very interested in any performance / throughput numbers with today's various solutions