redpanda-data / redpanda

Redpanda is a streaming data platform for developers. Kafka API compatible. 10x faster. No ZooKeeper. No JVM!
https://redpanda.com
9.42k stars 577 forks source link

admin api: add support for batch requests in place of current per-partition requests #14267

Open twmb opened 11 months ago

twmb commented 11 months ago

Who is this for and what problem do they have today?

Today, many admin APIs are per-partition based. If we want to query information or act on a bunch of partitions at once (i.e., a topic, or multiple topics), we end up issuing dozens to thousands of admin API calls, dependent on the number of partitions in a topic.

Two easy examples of this are in rpk topic describe-storage, and in PR #13684 introducing rpk cluster partitions move. In the former, we describe TS for a whole topic, but the admin API is partition based: we have to first discover the number of partitions in a topic, and then issue a request per partition. In the PR, the main holdup right now in review is that we have to issue two to three API calls per partition to learn what is necessary and then actually execute a move (per partition).

When a broker is remote, the API calls have a higher and higher risk of failing. If we assume that 99% of HTTP requests are successful, then if I had only three API calls to issue, I'd have a 97% chance (.99^3) of succeeding. Most of the time, my rpk command would work. If I have to issue three per-partition requests against a topic with 100 partitions, then I have a 5% chance of succeeding (.99^300). We set ourselves up for partial failure all the time when we have to execute bulk operations in an individual manner.

As well, a remote broker is slow to talk to -- and if we have a 5% chance of a single request being slow due to internet weather or other problems, then it is extremely likely that every batch command is no matter what slow today, slower than it needs to be.

This problem also affects Console: there are users who have thousands of partitions per topic. If we want to show the aggregated data in tiered storage (i.e., how much is local, and how much is remote), we have a single load of a Console page issuing thousands of admin API calls. This times out, users get an error, and they have a worse experience.

Kafka has realized the importance of batching and is slowly converting many of is requests (particularly admin operation requests) to batching. We should also realize the importance of this and ensure we support batching where possible.

What are the success criteria?

Every per-partition or per-topic admin API has a corresponding batch API that accepts topics and partitions in a request body.

JIRA Link: CORE-1516

michael-redpanda commented 10 months ago

Is this effectively asking for an endpoint that accepts a JSON object or returns one?

piyushredpanda commented 10 months ago

I'd imagine that we also need to paginate the response given a single request could be across 100s of topics/partitions?

twmb commented 10 months ago

Yes to endpoints that accept either JSON arrays or objects (likely arrays, and return them -- most endpoints return JSON objects, we'd want that to be changed to an array likely)

Pagination may be important but not critically required, for a few reasons. Most of the Kafka APIs should be paginated but aren't and systems live with that. The way Kafka gets around this is with filters -- e.g., in the schema registry, the /schemas endpoint has a subjectPrefix query parameter. More important for us though, the endpoints that are most important to add batching to are endpoints where we control the endpoint (i.e., I want data on these 30 partitions, or this one topic [including all the partitions]) -- so the client already has a bit of control on the response size to begin with. I'd kinda argue that pagination is just not required to start and the main thing we're looking for is to collapse a few hundred API calls into one.

michael-redpanda commented 10 months ago

Is the ask that on top of batching, there also be the ability to filter?

Let's take the /v1/debug/partition/{namespace}/{topic}/{partition} endpoint (I know it's a debug endpoint, but I think it illustrates the the point)

We could create a batch endpoint called /v1/debug/partition that returns everything. That would be a lot. Are you also asking for the ability to add in query parameters like namespace=, topic= ? The absent of them means "everything"?

twmb commented 10 months ago

I think we can discuss it more to see what makes the most sense -- I don't want to prescribe the best way up front. It'd be acceptable to me to reject empty input, and only reply with everything if the user specifically requests *, for example (glob or regex filter).

Or, if we agree that pagination is the way to go, that's good with me too.

michael-redpanda commented 10 months ago

I think filtering may be an easier thing (less time) than implementing pagination. I think we will at some point have to implement pagination. But that may be a longer term thing.