tdcmeehan commented 4 years ago

Presto Disaggregated Coordinator

Motivation

In Facebook, we find that coordinators have become bottlenecks to cluster scalability in a number of ways.

1) Once we scale to some number of nodes in the hundreds, we find that coordinators frequently have a high potential of being slowed down by an excessively high number of tasks. 2) In certain high QPS use cases, we have found that workers can become starved of splits, by excessive CPU being spent on task updates. This bottleneck in the coordinator is alleviated by reducing the concurrency, but this leaves the cluster under-utilized.

Furthermore, because of the de-facto constraint that there must be one coordinator per cluster, this limits the size of the worker pool to whatever number can handle the QPS from conditions 1 and 2. This means it’s very difficult to deploy large high-QPS clusters than can take on queries of moderate complexity (such as high stage count queries).

In short, tasks are observed to be the chief bottleneck in scaling the coordinator. As tasks are themselves split up by query, a natural approach to scale the coordinator is to send queries to multiple coordinators.

Architecture

9mLQaON8rYHZUDmOaBh8xegn4Rq5E6VmwyIOkDBn3S9lnGXLYdYce4iDpcT5gD9ZRCN-4ZsGk03nBGDOGstg-TF_1uPgOWhysk6PsHZfw-w7c7EZiGwkSh5-mGWBF5pmQHsrY3yD

Query Execution

Notes

A load balancer will be provided which can properly load balance between several coordinators which correspond to a single cluster.
A new process, the Resource Manager, will be added to Presto. It is responsible for aggregating information across the cluster, namely query information and node information. There can be multiple resource managers in order to facilitate faults in the resource manager, however each resource manager will have the entire view of the cluster resources.

Flow of execution

Presto client submits a POST for query execution. This goes to the load balancer, which then redirects to a coordinator. There may be heuristics which can be used to pick a coordinator.
The coordinator parses, analyzes, and submits to the resource group the query as before. The resource group is periodically refreshed with cluster-level information from the resource manager.
Before and after dequeueing from the resource group, heartbeats are sent to the resource manager with basic information, including CPU utilization, resource group selector, and memory pool utilization, of the query.
The coordinator has view of the entire worker fleet. However, more detailed information is fetched from the resource manager which aids the coordinator in selecting the least utilized nodes for execution.
As queries are executing, node-level information is sent back to the resource manager in regular intervals.

UI

Each coordinator may take the responsibility of servicing UI requests.
Information which is cluster-level, such as the query list and cluster stats, are redirected to the resource manager, which has sufficient information to reconstruct the data in these endpoints.
Information which is query-level, such as functionality to kill queries or to examine the query details, are serviced locally if the local coordinator has that query, otherwise they are redirected to the resource manager, which then proxies the request to the appropriate coordinator.

tdcmeehan commented 4 years ago

Related to #10174

BlueStalker commented 4 years ago

@tdcmeehan thanks for the update, from the description above, it looks like Resource Mananger would be new JVM process running separately against the main coordinator process, are you going to design RM as master-slave mode or distributed mode like the coordinators would go to random RM and the RMs will internally exchange the data?

l1x commented 3 years ago

Would it be possible to have a setup where every worker is a coordinator as well and the whole cluster load management is solved by a distributed state machine?

GithubZhitao commented 2 years ago

Is this topic alive ? milestone will come soon.

l1x commented 2 years ago

Is this topic alive ? milestone will come soon.

I don't think it is going to happen. The latest plans include moving Presto to Spark, which probably will cancel out this plan.

tdcmeehan commented 2 years ago

Hello @l1x and @GithubZhitao ! Actually, this feature is complete, and we're preparing to add the public documentation so it's easy for everyone get started using disaggregated coordinators. We've been in production with this feature at Meta for several months, and it's increased reliability for our clusters due to multiple coordinators being able to serve traffic.

@l1x on the contrary, Presto-on-Spark is a totally separate initiative. Presto-on-Spark focuses on batch reliability, the ability to scale shuffle, and leverage Spark's built-in checkpointing capabilities. These needs typically align with very large batch workloads. They overlap in the sense that a Spark driver is typically launched client side, which makes the Spark driver inherently highly available. Multi-coordinators, however, are primarily intended to be used by those who:

1) Encounter a performance bottleneck in the coordinator: this allows you to horizontally scale the coordinator. 2) Would like to reduce the blast radius of coordinator unavailability: this allows for multiple coordinators to work in parallel, cooperatively executing queries, and unavailability just results in a subset of query failures (rather than the entire cluster's queries failing). 3) Need to run their coordinators on smaller machines: by scaling out the coordinator pool, you can run a smaller set of queries on each coordinator, without reducing the capacity available to execute the queries.

We'll be adding documentation shortly. Once it's added, we'll close out this ticket, since it's code complete and used in a really tough production environment. :). Let me know if you have any other questions, or feel free to ask about it in the PrestoDB Slack. Thanks!

tdcmeehan commented 2 years ago

Would it be possible to have a setup where every worker is a coordinator as well and the whole cluster load management is solved by a distributed state machine?

https://blog.bernd-ruecker.com/how-we-built-a-highly-scalable-distributed-state-machine-f2595e3c0422

https://en.wikipedia.org/wiki/Raft_(algorithm)

I didn't see this earlier. While using something like Raft would solve the problem of distributed consensus (needed primarily for resource groups), the primary motivation for separating your workers from the coordinator pool is more around efficiency. It's still possible to execute in this way, but some feature won't work or make sense (such as resource groups and memory management).

Workers and coordinators have totally different CPU needs. Workers operate over large chunks of data ("blocks"), which are efficiently garbage collected. They tend to operate code loops which are tightly optimized ("vectorized" over these blocks). And their concurrency is limited to improve CPU efficiency and reduce context switching overhead. Threads are spun up and prioritized by an internal scheduler, which is tightly controlled. On the other hand, coordinators use lots of discontiguous memory, needs frequent inefficient GC, and have limited concurrency control. This is because, as the control plane, they need to take in and initiate lots of connections, and in this environment it's better to have higher concurrency (and the context switching overhead is relatively tolerable).

When one mixes these execution patterns, the benefits of limited concurrency on the worker may be diminished. There's also the problem that Presto lacks genuine isolation: both the coordinator code and the worker code would execute in separate threads in the JVM. This lack of isolation would make it so that coordinator GC issues or deadlocks would reduce compute availability in the cluster, and likewise worker GC issues could reduce client availability. There may be ways to fix all of this in the future, but for now, given the current architecture, it's simply more preferable to segregate coordinators from workers, especially in performance sensitive settings.

l1x commented 2 years ago

@tdcmeehan thanks for the detailed answers.

GithubZhitao commented 2 years ago

@tdcmeehan thanks for your patience ! Hope to see the excited feature soon.

ShubhamChaurasia commented 2 years ago

Thanks for all the efforts, this is a great feature.

@tdcmeehan @swapsmagic We wanted to try this out. From your comment above I understand that the official docs are being worked upon, however I wanted to check if there are some raw steps given in some commit or PR using which we can deploy ResourceManager and multiple coordinators ? What would be the basic steps to deploy and test it ? Thanks!

swapsmagic commented 2 years ago

Recommended release version: 0.267

Below is the minimal configuration to enable disagg coordinator:

Minimal Resource Manager Configuration

    resource-manager=true
    resource-manager-enabled=true
    coordinator=false
    node-scheduler.include-coordinator=false
    http-server.http.port=8080
    thrift.server.port=8081
    query.max-memory=50GB
    query.max-memory-per-node=1GB
    query.max-total-memory-per-node=2GB
    discovery-server.enabled=true
    discovery.uri=http://example.net:8080 (Point to resource manager host/vip)
    thrift.server.ssl.enabled=true

Minimal Coordinator configuration

    coordinator=true
    node-scheduler.include-coordinator=false
    http-server.http.port=8080
    query.max-memory=50GB
    query.max-memory-per-node=1GB
    query.max-total-memory-per-node=2GB
    discovery.uri=http://example.net:8080 (Point to resource manager host/vip)
    resource-manager-enabled=true

Minimal Worker Configuration

    coordinator=false
    http-server.http.port=8080
    query.max-memory=50GB
    query.max-memory-per-node=1GB
    query.max-total-memory-per-node=2GB
    discovery.uri=http://example.net:8080 (Point to resource manager host/vip)
    resource-manager-enabled=true

ShubhamChaurasia commented 2 years ago

thanks @swapsmagic

zihuanglini commented 2 years ago

@tdcmeehan @swapsmagic Thanks for all the efforts, this is a great feature. But I have a question. When a cluster has multiple resource managers, how to configure multiple addresses for discovery.uri? Something like discovery.uri=http://example.net:8080;http://abc.net:8080 ?

evenksAojing commented 2 years ago

@tdcmeehan @swapsmagic I have the same question. Is there SPOF for discovery service ? Or how to config multiple resource managers in a cluster ?

ShubhamChaurasia commented 2 years ago

@swapsmagic While testing I am seeing a difference in number of nodes every time I run select * from system.runtime.nodes.

Here is one of the outputs of above query where it contains all the nodes

[
[
"presto-coordinator-a6c072f1-60fe-a694-f411-39a2f3998aa3",
"http://10.0.169.125:8080",
"0.272",
true,
"active"
],
[
"presto-coordinator-eec072f1-60f4-eaa5-d5de-2f8860eaa03b",
"http://10.0.129.170:8080",
"0.272",
true,
"active"
],
[
"presto-resource-manager-64c072f1-60b3-c0b2-7fd8-9a6a910abb23",
"http://10.0.132.124:8080",
"0.272",
false,
"active"
],
[
"presto-resource-manager-c6c072f1-60d4-7f3f-8484-99b663a48f87",
"http://10.0.128.161:8080",
"0.272",
false,
"active"
],
[
"presto-worker-44c072f1-6120-21f8-45d5-fe1b63da89f0",
"http://10.0.167.120:8080",
"0.272",
false,
"active"
],
[
"presto-worker-88c072f1-6118-735c-9c10-8ff93d18e44d",
"http://10.0.163.108:8080",
"0.272",
false,
"active"
],
[
"presto-worker-e6c072f1-6128-99d4-45d0-dd5703d4ed08",
"http://10.0.147.62:8080",
"0.272",
false,
"active"
]
]

Here is another where one of the RM is missing

[
[
"presto-coordinator-a6c072f1-60fe-a694-f411-39a2f3998aa3",
"http://10.0.169.125:8080",
"0.272",
true,
"active"
],
[
"presto-coordinator-eec072f1-60f4-eaa5-d5de-2f8860eaa03b",
"http://10.0.129.170:8080",
"0.272",
true,
"active"
],
[
"presto-resource-manager-64c072f1-60b3-c0b2-7fd8-9a6a910abb23",
"http://10.0.132.124:8080",
"0.272",
false,
"active"
],
[
"presto-worker-44c072f1-6120-21f8-45d5-fe1b63da89f0",
"http://10.0.167.120:8080",
"0.272",
false,
"active"
],
[
"presto-worker-88c072f1-6118-735c-9c10-8ff93d18e44d",
"http://10.0.163.108:8080",
"0.272",
false,
"active"
],
[
"presto-worker-e6c072f1-6128-99d4-45d0-dd5703d4ed08",
"http://10.0.147.62:8080",
"0.272",
false,
"active"
]
]

Here is one more where one of the worker and RM are missing

[
[
"presto-coordinator-a6c072f1-60fe-a694-f411-39a2f3998aa3",
"http://10.0.169.125:8080",
"0.272",
true,
"active"
],
[
"presto-coordinator-eec072f1-60f4-eaa5-d5de-2f8860eaa03b",
"http://10.0.129.170:8080",
"0.272",
true,
"active"
],
[
"presto-resource-manager-c6c072f1-60d4-7f3f-8484-99b663a48f87",
"http://10.0.128.161:8080",
"0.272",
false,
"active"
],
[
"presto-worker-88c072f1-6118-735c-9c10-8ff93d18e44d",
"http://10.0.163.108:8080",
"0.272",
false,
"active"
],
[
"presto-worker-e6c072f1-6128-99d4-45d0-dd5703d4ed08",
"http://10.0.147.62:8080",
"0.272",
false,
"active"
]
]

Setup

2 Coordinators
2 Resource managers
3 workers

Resource managers are behind an ALB. And all presto servers above contain ALB URL as their discovery.uri.
And through my presto client I am connecting to one of the coordinators directly.

Relevant Logs

I can see the following kinds of logs in both the RMs

2022-05-21T09:35:34.422Z    INFO    node-state-poller-0 com.facebook.presto.metadata.DiscoveryNodeManager   Previously active node is missing: presto-resource-manager-64c072f1-60b3-c0b2-7fd8-9a6a910abb23 (last seen at 10.0.132.124)
2022-05-21T09:35:34.422Z    INFO    node-state-poller-0 com.facebook.presto.metadata.DiscoveryNodeManager   Previously active node is missing: presto-worker-44c072f1-6120-21f8-45d5-fe1b63da89f0 (last seen at 10.0.167.120)
2022-05-21T09:36:14.427Z    INFO    node-state-poller-0 com.facebook.presto.metadata.DiscoveryNodeManager   Previously active node is missing: presto-worker-44c072f1-6120-21f8-45d5-fe1b63da89f0 (last seen at 10.0.167.120)
2022-05-21T09:36:54.431Z    INFO    node-state-poller-0 com.facebook.presto.metadata.DiscoveryNodeManager   Previously active node is missing: presto-worker-44c072f1-6120-21f8-45d5-fe1b63da89f0 (last seen at 10.0.167.120)
2022-05-21T09:37:34.436Z    INFO    node-state-poller-0 com.facebook.presto.metadata.DiscoveryNodeManager   Previously active node is missing: presto-resource-manager-64c072f1-60b3-c0b2-7fd8-9a6a910abb23 (last seen at 10.0.132.124)
2022-05-21T09:37:34.436Z    INFO    node-state-poller-0 com.facebook.presto.metadata.DiscoveryNodeManager   Previously active node is missing: presto-worker-44c072f1-6120-21f8-45d5-fe1b63da89f0 (last seen at 10.0.167.120)
2022-05-21T09:38:04.439Z    INFO    node-state-poller-0 com.facebook.presto.metadata.DiscoveryNodeManager   Previously active node is missing: presto-worker-e6c072f1-6128-99d4-45d0-dd5703d4ed08 (last seen at 10.0.147.62)
2022-05-21T09:38:14.441Z    INFO    node-state-poller-0 com.facebook.presto.metadata.DiscoveryNodeManager   Previously active node is missing: presto-worker-44c072f1-6120-21f8-45d5-fe1b63da89f0 (last seen at 10.0.167.120)
2022-05-21T09:38:44.445Z    INFO    node-state-poller-0 com.facebook.presto.metadata.DiscoveryNodeManager   Previously active node is missing: presto-resource-manager-64c072f1-60b3-c0b2-7fd8-9a6a910abb23 (last seen at 10.0.132.124)

I can confirm that the above nodes are reachable all the time. Is there a configurable sync time for state to be shared among different discovery server instances ? Or could there be some other reason for this ?

Thanks

swapsmagic commented 2 years ago

@tdcmeehan @swapsmagic I have the same question. Is there SPOF for discovery service ? Or how to config multiple resource managers in a cluster ?

In disagg coordinator architecture, discovery server runs on resource manager. To avoid SPOF you need to run multiple resource managers. To setup them, you need to put them behind a vip and provide that in discovery.uri as mentioned in the doc: discovery.uri=http://example.net:8080 (Point to resource manager host/vip)

swapsmagic commented 2 years ago

@ShubhamChaurasia Can you share your configuration that you have used for all the component, and will take a look at it. Can't think of a reason why this is happening but will check again along with the configuration used.

ShubhamChaurasia commented 2 years ago

@swapsmagic Sure, please find the configs below -

Resource Manager

coordinator=false
resource-manager=true
resource-manager-enabled=true
discovery-server.enabled=true
http-server.threads.max=500
sink.max-buffer-size=1GB
query.max-memory=14746MB
query.max-memory-per-node=6601364734B
query.max-total-memory-per-node=7921637681B
query.max-history=40
query.min-expire-age=30m
query.client.timeout=30m
query.stage-count-warning-threshold=100
query.max-stage-count=150
http-server.http.port=8080
discovery.uri=${DISCOVERY_URI}

Coordinator

coordinator=true
node-scheduler.include-coordinator=false
resource-manager-enabled=true
discovery-server.enabled=false
http-server.threads.max=500
sink.max-buffer-size=1GB
query.max-memory=14746MB
query.max-memory-per-node=6601364734B
query.max-total-memory-per-node=7921637681B
query.max-history=40
query.min-expire-age=30m
query.client.timeout=30m
query.stage-count-warning-threshold=100
query.max-stage-count=150
http-server.http.port=8080
discovery.uri=${DISCOVERY_URI}

Worker

coordinator=false
resource-manager-enabled=true
http-server.threads.max=500
sink.max-buffer-size=1GB
query.max-memory=14746MB
query.max-memory-per-node=6601364734B
query.max-total-memory-per-node=7921637681B
query.max-history=40
query.min-expire-age=30m
query.client.timeout=30m
query.stage-count-warning-threshold=100
query.max-stage-count=150
http-server.http.port=8080
discovery.uri=${DISCOVERY_URI}

DISCOVERY_URI is a load balancer URI behind which we have two instances of RM.

LiuGuH commented 2 years ago

I can‘t find any doc abount this feature from the offcial website ( https://prestodb.io/docs/current/). Is this feature released ？

ShubhamChaurasia commented 2 years ago

@swapsmagic @cliandy Any luck with number of nodes issue mentioned above ?

yl09099 commented 2 years ago

Hello, I would like to ask a question, in the distributed resource manager, can only one resource manager work? I now have two resource managers deployed, two coordinators deployed, and multiple workers. I can't see all the workers in each coordinator, but the number of workers in both coordinators adds up to the number of workers that I deploy all the workers. Why is this? @tdcmeehan

ShubhamChaurasia commented 2 years ago

@tdcmeehan @swapsmagic Is there a top level PR or Issue where I can find all the RM related PRs/Commits ? https://github.com/prestodb/presto/pull/15479/commits contains some of them but I guess it's not the complete list.

So far I have been able to locate the following -

Are there any more of them ?

hackeryang commented 1 year ago

First of all, this splendid feature was based on Thrift communications: https://github.com/prestodb/presto/pull/13894
Recent PRs about this issue, which we can track the commits by Mr Tim Meehan and Swapnil Tailor:

0ZhangJc0 commented 10 months ago

@tdcmeehan @swapsmagic I have the same question. Is there SPOF for discovery service ? Or how to config multiple resource managers in a cluster ?

In disagg coordinator architecture, discovery server runs on resource manager. To avoid SPOF you need to run multiple resource managers. To setup them, you need to put them behind a vip and provide that in discovery.uri as mentioned in the doc: discovery.uri=http://example.net:8080 (Point to resource manager host/vip)

Will info of node states be synchronized between multiple resource managers? Will the previously running tasks on the cluster be killed when VIP switches?

anitroslin01 commented 5 months ago

I was trying to achieve high availability using multiple Presto coordinators (hot/warm deployment) so that if one coordinator crashes, the other will serve the queries. To implement this, I was planning to externalize the discovery service so that coordinators will reach out to the discovery server to get the active worker counts. In case of a coordinator failure, queries will be redirected to the standby coordinator via a load balancer. I am looking for a solution to externalize the discovery server from the coordinators. Does anyone have any clue about this? Is it possible to bring up a Presto container and convert that to a dedicated discovery server?

prestodb / presto

[Design] Disaggregated Presto Coordinators #15453

Presto Disaggregated Coordinator

Motivation

Architecture

Query Execution

Notes

Flow of execution

UI

Minimal Resource Manager Configuration

Minimal Coordinator configuration

Minimal Worker Configuration

Setup

Relevant Logs