auditing: construct metrics probe during service construction

michael-redpanda commented 3 days ago

Move the construction of the audit probe from service start to service construction. Previously, the audit probe was created as a unique pointer when the service was started and destroyed when the service stopped.

Since the audit service has the same lifespan as the application, creating the probe during service construction reduces the risk of dereferencing a null pointer if the probe is accessed before the service has started.

Backports Required

[ ] none - not a bug fix
[ ] none - this is a backport
[ ] none - issue does not exist in previous branches
[ ] none - papercut/not impactful enough to backport
[X] v24.2.x
[X] v24.1.x
[ ] v23.3.x

Release Notes

Bug Fixes

Construct audit metrics probe during service initialization to prevent null pointer access.

michael-redpanda commented 2 days ago

This looks safe, but I have concerns that the service is being used before it is started.

Yeah I was discussing exactly this with @oleiman yesterday. The problem is that I don't think we can kick off the sink until after the kafka API is available.

For example, it's not clear without deep inspection of the code that calling drain is safe before the sink has been started.

So the drain timer isn't armed until after the sink has been started so I believe this is safe.

It's also not clear that client_shard_id has been constructed before events can be received on other shards.

It should be constructed when the service is constructed, no?

I think it may make sense to set the drain timer in start(), and move stuff from start to the constructor where possible.

It is, in a somewhat round-about way based on the value of audit_enabled

Overall, it would be better to fix this by starting the service before it is used..

So again, I don't think we can start the sink before the kafka service is up. One idea I had was to start the service very early and then enable the sink via another method once the kafka API is up and running.

BenPope commented 2 days ago

This looks safe, but I have concerns that the service is being used before it is started.

Yeah I was discussing exactly this with @oleiman yesterday. The problem is that I don't think we can kick off the sink until after the kafka API is available.

Agreed.

For example, it's not clear without deep inspection of the code that calling drain is safe before the sink has been started.

So the drain timer isn't armed until after the sink has been started so I believe this is safe.

OK, I see it's ultimately called from audit_log_manager::start().

It's also not clear that client_shard_id has been constructed before events can be received on other shards.

It should be constructed when the service is constructed, no?

I think I fumbled that comment. I was talking about audit_log_manager::_sink on shard client_id_shard, the concern was that the shards are created without an ordering. But if drain cannot be called before audit_log_manager::start has been called (despite receiving events), I think it's safe.

I think it may make sense to set the drain timer in start(), and move stuff from start to the constructor where possible.

It is, in a somewhat round-about way based on the value of audit_enabled

Yeah, after digging around, I see that.

Overall, it would be better to fix this by starting the service before it is used..

So again, I don't think we can start the sink before the kafka service is up. One idea I had was to start the service very early and then enable the sink via another method once the kafka API is up and running.

I get it now.

vbotbuildovich commented 2 days ago

/backport v24.2.x

vbotbuildovich commented 2 days ago

/backport v24.1.x

redpanda-data / redpanda