strimzi / strimzi-kafka-operator

Apache Kafka® running on Kubernetes
https://strimzi.io/
Apache License 2.0
4.84k stars 1.29k forks source link

Default Kafka user quotas are applied also to internal users #10367

Open im-konge opened 3 months ago

im-konge commented 3 months ago

When the default kafka quotas plugin is configured inside the .spec.kafka.quotas section of the Kafka resource, the quotas are applied to all of the users - as a default quotas. That means that they are applied also to the internal users, which can hit the quotas - for example when we set the controller mutation rate quota, the Topic operator can hit it during some of its operations.

For the strimzi quotas plugin type, this is handled using the "excluded principals" option of the plugin, where we are adding the internal users together with those specified inside the .spec.kafka.quotas section of the Kafka resource - so they are all excluded from the quotas.

But for the default Kafka quotas plugin, there is not such option that we can use.

To solve this, we can configure quotas to null values for the internal users, when the default quotas are configured in the Kafka resource. However, this is not that easy, as the quotas will be removed by User operator when they are created. Also, the information about the internal users would be accessible via the Kafka Admin API. This would not be trivial and it would require proposal to cover all the involved components that would need changes (Cluster operator, User operator, ...), together with the whole approach.

Another option is to document this inside our documentation - as it is maybe desired to limit the internal users as well. This would be the most simple way, but in the other hand it can cause issues - for example when someone would like to limit all other users, but keeping the TO and other components and their users without limitations.

We should discuss how to proceed with this or if there are other options that we should take into account.

scholzj commented 3 months ago

Triaged on the Community call on 8.8.2024: @im-konge will prepare a summary of what Strimzi parts might be affected by this and how.

im-konge commented 2 months ago

These part are (in my opinion and knowledge) affected by this issue:

scholzj commented 2 months ago
  • CruiseControl -> IIRC CC is sending messages to some internal topic to generate the model for rebalancing. When user sets the produce and fetch quotas, the CC can be affected as well.

So, what do we consider the minimal produce / fetch limit for Cruise Control to work?

  • I think that other components like MM2 or Connect/Connector can be affected as well, when we set the default quotas for produce and fetch.

I do not think we care. The user deploys them separately.

  • I'm not sure if User Operator is affected, as the quotas should not be (but maybe I'm wrong) applied to creation of the users and managing additional quotas.

It manages SCRAM-SHA users, ACLs and quotas. Does the mutation rate apply to that as well? Or is it only topics?

im-konge commented 2 months ago

It manages SCRAM-SHA users, ACLs and quotas. Does the mutation rate apply to that as well? Or is it only topics?

From what I read, it is only topics.

So, what do we consider the minimal produce / fetch limit for Cruise Control to work?

I don't know .. @kyguy do you have an idea?

scholzj commented 2 months ago

Discussed on the community call on 5.9.2024: This should be documented as a warning for the users. We should make it clear:

kyguy commented 1 month ago

That Cruise Control and Cruise Control metrics reporter need some minimal values to work properly (@kyguy will try to provide some values needed by Cruise Control)

Sorry I dropped the ball on this, let me do some calculations and provide an estimate for this tomorrow

kyguy commented 1 week ago

That Cruise Control and Cruise Control metrics reporter need some minimal values to work properly (@kyguy will try to provide some values needed by Cruise Control)

Apologies for the delay, here are some minimal produce/fetch limits (producer_byte_rate/consumer_byte_rate) for Cruise Control producer/consumers that should suffice for small clusters with default Cruise Control configurations