Multi-cluster Resource Management

prithvip commented 2 months ago

Motivation

Currently, each Presto cluster owns its own resource management and queue. The resource manager of each cluster decides which query in the cluster’s queue should be executed next, based on logic described in the cluster’s resource group configuration.

This status quo of cluster-level queueing works fine to manage one or a few clusters, but quickly becomes problematic when managing a deployment of tens or hundreds of Presto clusters. Some issues include:

Due to the queue being fragmented across multiple clusters, and the admission decision being made at the cluster level, resource group admission decisions are not correct globally, although they could be correct locally. For example, it is possible that one cluster might admit a query of lower priority than a query in another cluster’s queue.
Since query concurrency control is applied on a per-cluster basis, global concurrency control can only be achieved, if the concurrency limit is a multiple of the number of clusters. For example, it is currently impossible to enforce a resource group with a concurrency limit of 1 query, across multiple clusters.
Load-balancing queries effectively across multiple clusters requires the load balancer to understand the resource group and queue state of each cluster. This is difficult to achieve, and in practice, stale information leads to poor load balancing.
Since the query is queued on the cluster itself, restarting clusters for maintenance reasons requires either that the queued queries be killed, or a lengthy drain time.

High-level Proposal

We propose to add support in Presto for an optional queueing service that sits upstream of multiple Presto clusters. This queueing service will serve as the global entry-point for all queries from the client and will implement the QueuedStatement protocol. This queueing service is responsible for:

Authenticating the request
Adding the query to the global queue
Responding to clients polling for the queueing status of their query
Picking the next query to execute
Selecting the best cluster for execution of this query
Dispatching this query to the cluster
Redirecting the client to the picked cluster

The internal details of this queueing service are still under development, and this document focuses on the changes required to Presto for such a service to work.

Diagram

Presto Github Issue GRM

Design Constraints

There will be no client-server protocol changes. The queueing service should be able to be added without any changes to clients.
There should be minimal impact to the latency of end-to-end query execution.
There should be no impact users of Presto who do not implement or use this queueing service.

High-level Work Items

Because we want this to be transparent to Presto clients, the front-end of the queueing service will reuse the existing functionality of classes in the presto-dispatcher package such as QueuedStatementResource, DispatchManager, DispatchQuery, etc. The high-level work items are:

Allow for the queueing service to forward all required query metadata to the cluster, such as query ID, authenticated identity, query state timings, etc. This will involve changes to QueuedStatementResource, QueryStateMachine, and DispatchManager, so that coordinator can create a query with an already populated QueryId and AuthorizedIdentity.
De-couple the dispatching and execution phases of the query lifecycle. This will involve separating and removing dependencies that are required for execution of the query, but not for queueing, such as Metadata class. We will create a new module, presto-dispatcher, that has all classes required for the queueing phase, such as QueuedStatementResource, DispatchManager, ResourceGroupManager, etc. The presto-main module will have a dependency on this new module. Interfaces such as ResourceGroupManager and DispatchQuery will be in presto-dispatcher, with implementations such as InternalResourceGroupManager and LocalDispatchQuery in presto-main. In addition, it could be necessary to move some classes into a common module, if they are required by both presto-main and presto-dispatcher.

prithvip commented 2 months ago

cc @tdcmeehan @rschlussel @arhimondr

tdcmeehan commented 2 months ago

@prithvip thank you for writing this up. Do you mind creating an RFC for it? For large changes, we want to start writing detailed RFCs so we know 1) the reasons why we made certain decisions and 2) we have easily accessible offline documentation that explains the thinking that went into these decisions.

prithvip commented 2 months ago

@tdcmeehan No problem, I've created an RFC PR here: https://github.com/prestodb/rfcs/pull/23

prestodb / presto