redpanda-data / redpanda

Redpanda is a streaming data platform for developers. Kafka API compatible. 10x faster. No ZooKeeper. No JVM!
https://redpanda.com
9.64k stars 587 forks source link

rpk: cluster config edit incorrectly represents map object #10251

Open daisukebe opened 1 year ago

daisukebe commented 1 year ago

Version & Environment

Redpanda version: (use rpk version): e133de2f

When a config value is a map, rpk incorrectly represents the value which causes Redpanda server to fail parsing it. An example property is kafka_client_group_byte_rate_quota.

What went wrong?

When the property has some valid values, rpk cluster config edit can't change anything due to failing to parse the bad yaml. In the example below, I fail to change default_topic_partitions to 11 due to the validation error for kafka_client_group_byte_rate_quota which is untouched.

$ rpk cluster config edit
PROPERTY                            PRIOR                                                                                                                 NEW
default_topic_partitions            10                                                                                                                    11
kafka_client_group_byte_rate_quota  [map[clients_prefix:redpanda group_name:group2 quota:10240] map[clients_prefix:kafka group_name:group1 quota:10240]]  [map[clients_prefix:redpanda group_name:group2 quota:10240] map[clients_prefix:kafka group_name:group1 quota:10240]]

unable to edit: error updating config: Validation errors:
 * kafka_client_group_byte_rate_quota: YAML::BadSubscript (yaml-cpp: error at line 1, column 2: operator[] call on a scalar (key: "group_name"))

In the editor, the property values look like this.

116 # Per-group target produce quota byte rate (bytes per second). Client is considered part of the group if client_id contains clients_prefix (e.g. '[{'group_name': 'first_group','clients_prefix': 'group_1','quota': 10240}]')
117 kafka_client_group_byte_rate_quota:
118     - map[clients_prefix:redpanda group_name:group2 quota:10240]
119     - map[clients_prefix:kafka group_name:group1 quota:10240]

What should have happened instead?

It should be something like this, which can be parsed by Redpanda.

kafka_client_group_byte_rate_quota:
   - {"group_name": "group1","clients_prefix": "kafka","quota": 10240}
   - {"group_name": "group2","clients_prefix": "redpanda","quota": 10240}

How to reproduce the issue?

By default it's empty

$ rpk cluster config get kafka_client_group_byte_rate_quota
[]

Adding one rule: prefix: kafka

$ rpk cluster config set kafka_client_group_byte_rate_quota '[{"group_name": "group1","clients_prefix": "kafka","quota": 10240}]'
$ rpk cluster config get kafka_client_group_byte_rate_quota
- clients_prefix: kafka
  group_name: group1
  quota: 10240

Adding second rule: prefix: redpanda

$ rpk cluster config set kafka_client_group_byte_rate_quota '[{"group_name": "group1","clients_prefix": "kafka","quota": 10240},{"group_name": "group2","clients_prefix": "redpanda","quota": 10240}]'
$ rpk cluster config get kafka_client_group_byte_rate_quota
- clients_prefix: redpanda
  group_name: group2
  quota: 10240
- clients_prefix: kafka
  group_name: group1
  quota: 10240

Additional information

Redpanda runs v23.1.7 and complains below

WARN  2023-04-21 15:50:01,220 [shard 0] admin_api_server - admin_server.cc:433 - [_anonymous] exception intercepted - url: [http://127.0.0.1:9644/v1/cluster_config] http_return_status[400] reason - seastar::httpd::base_exception ({"kafka_client_group_byte_rate_quota":"YAML::BadSubscript (yaml-cpp: error at line 1, column 2: operator[] call on a scalar (key: \"group_name\"))"})

JIRA Link: CORE-1286

jcsp commented 1 year ago

@ZeDRoman we should consider whether we really want these properties to be maps:

jcsp commented 1 year ago

@r-vasquez looks like what's happened here is that two map-type configuration properties were added in 23.1.x, without picking up that rpk cluster config import/export/edit don't handle maps.

It's a concern that automated tests in cluster_config_test.py didn't pick this up, because they are meant to do a roundtrip of import/export to check for this kind of thing -- perhaps this happened because those new properties were empty, therefore survived a roundtrip in that state.

r-vasquez commented 1 year ago

Thanks for the context @jcsp

I see that the schema returns the 2 properties that follow the same pattern:

 {
      "meta.name":"kafka_client_group_fetch_byte_rate_quota",
      "meta.type":"array",
      "meta.items.type":"config::client_group_quota"
 },
 {
      "meta.name":"kafka_client_group_byte_rate_quota",
      "meta.type":"array",
      "meta.items.type":"config::client_group_quota"
   },

To identify those, is it safe to parse properties based on the presence of ::? Something like:

meta.type == array && strings.Contains(meta.items.type, "::") 

w.r.t Test, yes, both properties have null as default so they survived the roundtrip, I'll add those properties to the test.

jcsp commented 1 year ago

To identify those, is it safe to parse properties based on the presence of ::? Something like:

In general, no, but we might need to make this some kind of special hack for these properties. This is the first time someone has added something to cluster config that wasn't a basic type, so it is relying on whoever consumers the schema to know what "config::client_group_quota" means, which isn't great.

The type being reported as array is a bug. Basically these two properties have bogus schema information :-(

twmb commented 1 year ago

ISTM that we should avoid trying to hack around this in rpk at the moment, leave this undocumented, and fix this first in redpanda and then in rpk?

ZeDRoman commented 1 year ago

The "one or many" part of one_or_many_map_property seems unnecessary and can cause confusion because it gives two different ways to represent a single entry map. API clients doing reconciliation loops may be tripped up by this as well.

I used simillar approach to simple one_or_many_property property. Maybe @dotnwat can respond to this comment

dlex commented 1 year ago

FYI, I've made conversion_binding<> for a similar case where I needed the configuration data in a set, but @jcsp has pointed out the downsides of having a set in cluster property. So I've ended up with a vector in the property, and a bitmap in the binding (could be a set as well). Not in dev yet but hope to merge #10285 soon.

dotnwat commented 1 year ago

I used simillar approach to simple one_or_many_property property. Maybe @dotnwat can respond to this comment

What's the question @ZeDRoman? Generally I agree with others that one_or_many_property should be discouraged / removed if possible. Historically it was added purely out of a curiosity of the yaml api. I probably should have removed it before it ended up infecting things.

mattschumpert commented 1 year ago

Is there a consensus that this can be fixed? I hesitate to document e.g. the per-client/group quotas (which are a useful answer to Kafka's user quotas feature we lack) when there's this weird asterisk that you have to use the Admin API to configure them :/

ZeDRoman commented 1 year ago

Is there a consensus that this can be fixed? I hesitate to document e.g. the per-client/group quotas (which are a useful answer to Kafka's user quotas feature we lack) when there's this weird asterisk that you have to use the Admin API to configure them :/

I am thinking about best solution. I will answer later

mattschumpert commented 1 year ago

Thank you

twmb commented 11 months ago

Did this get done?

piyushredpanda commented 11 months ago

Nope, it did not.

mattschumpert commented 11 months ago

can we find even a basic/short-term solution to this (it presents to the user as a significant bug)?

piyushredpanda commented 11 months ago

We are chasing 5-7 such short-term things unfortunately and risk that they won't land by v23.3.1 anyway. So, yes we can, but not for another couple or so months.

mattschumpert commented 11 months ago

@piyushredpanda I wonder if we'll experience this with the config constraints. @BenPope see above, will that be a map type?

rl0nergan commented 3 months ago

hey has there been any updates on this or does anyone have any known workarounds, especially when running in k8s with the redpanda helm chart?

My team is building a metrics pipeline using redpanda and rate-limiting is a pretty important feature for us. But once we set kafka_client_group_byte_rate_quota via the helm chart we can no longer make any config changes until we exec into a redpanda pod and basically re-running the script from the redpanda-configuration job, except after the export we have to manually remove the rate-limit values.

mattschumpert commented 3 months ago

@chrisseto for clarifying the Helm behavior.

But @rl0nergan FYI 24.2 will add support for setting client quotas via the Kafka AP AlterQtuoas API, and using a node-wise unit vs core-wise, which may prove easier to manage. The configs will be deprecated but remain for 1-2 releases

chrisseto commented 3 months ago

That's more or less expected behavior given this bug. The chart uses export and import to set configs in bulk. If RPK itself isn't playing nicely with this field, there's not much the helm chart can do.

rl0nergan commented 3 months ago

figured that would be the case, thanks for confirming