Open tdcmeehan opened 4 years ago
Related to #10174
@tdcmeehan thanks for the update, from the description above, it looks like Resource Mananger would be new JVM process running separately against the main coordinator process, are you going to design RM as master-slave mode or distributed mode like the coordinators would go to random RM and the RMs will internally exchange the data?
Would it be possible to have a setup where every worker is a coordinator as well and the whole cluster load management is solved by a distributed state machine?
Is this topic alive ? milestone will come soon.
Is this topic alive ? milestone will come soon.
I don't think it is going to happen. The latest plans include moving Presto to Spark, which probably will cancel out this plan.
Hello @l1x and @GithubZhitao ! Actually, this feature is complete, and we're preparing to add the public documentation so it's easy for everyone get started using disaggregated coordinators. We've been in production with this feature at Meta for several months, and it's increased reliability for our clusters due to multiple coordinators being able to serve traffic.
@l1x on the contrary, Presto-on-Spark is a totally separate initiative. Presto-on-Spark focuses on batch reliability, the ability to scale shuffle, and leverage Spark's built-in checkpointing capabilities. These needs typically align with very large batch workloads. They overlap in the sense that a Spark driver is typically launched client side, which makes the Spark driver inherently highly available. Multi-coordinators, however, are primarily intended to be used by those who:
1) Encounter a performance bottleneck in the coordinator: this allows you to horizontally scale the coordinator. 2) Would like to reduce the blast radius of coordinator unavailability: this allows for multiple coordinators to work in parallel, cooperatively executing queries, and unavailability just results in a subset of query failures (rather than the entire cluster's queries failing). 3) Need to run their coordinators on smaller machines: by scaling out the coordinator pool, you can run a smaller set of queries on each coordinator, without reducing the capacity available to execute the queries.
We'll be adding documentation shortly. Once it's added, we'll close out this ticket, since it's code complete and used in a really tough production environment. :). Let me know if you have any other questions, or feel free to ask about it in the PrestoDB Slack. Thanks!
Would it be possible to have a setup where every worker is a coordinator as well and the whole cluster load management is solved by a distributed state machine?
I didn't see this earlier. While using something like Raft would solve the problem of distributed consensus (needed primarily for resource groups), the primary motivation for separating your workers from the coordinator pool is more around efficiency. It's still possible to execute in this way, but some feature won't work or make sense (such as resource groups and memory management).
Workers and coordinators have totally different CPU needs. Workers operate over large chunks of data ("blocks"), which are efficiently garbage collected. They tend to operate code loops which are tightly optimized ("vectorized" over these blocks). And their concurrency is limited to improve CPU efficiency and reduce context switching overhead. Threads are spun up and prioritized by an internal scheduler, which is tightly controlled. On the other hand, coordinators use lots of discontiguous memory, needs frequent inefficient GC, and have limited concurrency control. This is because, as the control plane, they need to take in and initiate lots of connections, and in this environment it's better to have higher concurrency (and the context switching overhead is relatively tolerable).
When one mixes these execution patterns, the benefits of limited concurrency on the worker may be diminished. There's also the problem that Presto lacks genuine isolation: both the coordinator code and the worker code would execute in separate threads in the JVM. This lack of isolation would make it so that coordinator GC issues or deadlocks would reduce compute availability in the cluster, and likewise worker GC issues could reduce client availability. There may be ways to fix all of this in the future, but for now, given the current architecture, it's simply more preferable to segregate coordinators from workers, especially in performance sensitive settings.
@tdcmeehan thanks for the detailed answers.
@tdcmeehan thanks for your patience ! Hope to see the excited feature soon.
Thanks for all the efforts, this is a great feature.
@tdcmeehan @swapsmagic We wanted to try this out. From your comment above I understand that the official docs are being worked upon, however I wanted to check if there are some raw steps given in some commit or PR using which we can deploy ResourceManager and multiple coordinators ? What would be the basic steps to deploy and test it ? Thanks!
Recommended release version: 0.267
Below is the minimal configuration to enable disagg coordinator:
resource-manager=true
resource-manager-enabled=true
coordinator=false
node-scheduler.include-coordinator=false
http-server.http.port=8080
thrift.server.port=8081
query.max-memory=50GB
query.max-memory-per-node=1GB
query.max-total-memory-per-node=2GB
discovery-server.enabled=true
discovery.uri=http://example.net:8080 (Point to resource manager host/vip)
thrift.server.ssl.enabled=true
coordinator=true
node-scheduler.include-coordinator=false
http-server.http.port=8080
query.max-memory=50GB
query.max-memory-per-node=1GB
query.max-total-memory-per-node=2GB
discovery.uri=http://example.net:8080 (Point to resource manager host/vip)
resource-manager-enabled=true
coordinator=false
http-server.http.port=8080
query.max-memory=50GB
query.max-memory-per-node=1GB
query.max-total-memory-per-node=2GB
discovery.uri=http://example.net:8080 (Point to resource manager host/vip)
resource-manager-enabled=true
thanks @swapsmagic
@tdcmeehan @swapsmagic Thanks for all the efforts, this is a great feature. But I have a question. When a cluster has multiple resource managers, how to configure multiple addresses for discovery.uri? Something like discovery.uri=http://example.net:8080;http://abc.net:8080 ?
@tdcmeehan @swapsmagic I have the same question. Is there SPOF for discovery service ? Or how to config multiple resource managers in a cluster ?
@swapsmagic
While testing I am seeing a difference in number of nodes every time I run select * from system.runtime.nodes
.
[
[
"presto-coordinator-a6c072f1-60fe-a694-f411-39a2f3998aa3",
"http://10.0.169.125:8080",
"0.272",
true,
"active"
],
[
"presto-coordinator-eec072f1-60f4-eaa5-d5de-2f8860eaa03b",
"http://10.0.129.170:8080",
"0.272",
true,
"active"
],
[
"presto-resource-manager-64c072f1-60b3-c0b2-7fd8-9a6a910abb23",
"http://10.0.132.124:8080",
"0.272",
false,
"active"
],
[
"presto-resource-manager-c6c072f1-60d4-7f3f-8484-99b663a48f87",
"http://10.0.128.161:8080",
"0.272",
false,
"active"
],
[
"presto-worker-44c072f1-6120-21f8-45d5-fe1b63da89f0",
"http://10.0.167.120:8080",
"0.272",
false,
"active"
],
[
"presto-worker-88c072f1-6118-735c-9c10-8ff93d18e44d",
"http://10.0.163.108:8080",
"0.272",
false,
"active"
],
[
"presto-worker-e6c072f1-6128-99d4-45d0-dd5703d4ed08",
"http://10.0.147.62:8080",
"0.272",
false,
"active"
]
]
[
[
"presto-coordinator-a6c072f1-60fe-a694-f411-39a2f3998aa3",
"http://10.0.169.125:8080",
"0.272",
true,
"active"
],
[
"presto-coordinator-eec072f1-60f4-eaa5-d5de-2f8860eaa03b",
"http://10.0.129.170:8080",
"0.272",
true,
"active"
],
[
"presto-resource-manager-64c072f1-60b3-c0b2-7fd8-9a6a910abb23",
"http://10.0.132.124:8080",
"0.272",
false,
"active"
],
[
"presto-worker-44c072f1-6120-21f8-45d5-fe1b63da89f0",
"http://10.0.167.120:8080",
"0.272",
false,
"active"
],
[
"presto-worker-88c072f1-6118-735c-9c10-8ff93d18e44d",
"http://10.0.163.108:8080",
"0.272",
false,
"active"
],
[
"presto-worker-e6c072f1-6128-99d4-45d0-dd5703d4ed08",
"http://10.0.147.62:8080",
"0.272",
false,
"active"
]
]
[
[
"presto-coordinator-a6c072f1-60fe-a694-f411-39a2f3998aa3",
"http://10.0.169.125:8080",
"0.272",
true,
"active"
],
[
"presto-coordinator-eec072f1-60f4-eaa5-d5de-2f8860eaa03b",
"http://10.0.129.170:8080",
"0.272",
true,
"active"
],
[
"presto-resource-manager-c6c072f1-60d4-7f3f-8484-99b663a48f87",
"http://10.0.128.161:8080",
"0.272",
false,
"active"
],
[
"presto-worker-88c072f1-6118-735c-9c10-8ff93d18e44d",
"http://10.0.163.108:8080",
"0.272",
false,
"active"
],
[
"presto-worker-e6c072f1-6128-99d4-45d0-dd5703d4ed08",
"http://10.0.147.62:8080",
"0.272",
false,
"active"
]
]
discovery.uri
.I can see the following kinds of logs in both the RMs
2022-05-21T09:35:34.422Z INFO node-state-poller-0 com.facebook.presto.metadata.DiscoveryNodeManager Previously active node is missing: presto-resource-manager-64c072f1-60b3-c0b2-7fd8-9a6a910abb23 (last seen at 10.0.132.124)
2022-05-21T09:35:34.422Z INFO node-state-poller-0 com.facebook.presto.metadata.DiscoveryNodeManager Previously active node is missing: presto-worker-44c072f1-6120-21f8-45d5-fe1b63da89f0 (last seen at 10.0.167.120)
2022-05-21T09:36:14.427Z INFO node-state-poller-0 com.facebook.presto.metadata.DiscoveryNodeManager Previously active node is missing: presto-worker-44c072f1-6120-21f8-45d5-fe1b63da89f0 (last seen at 10.0.167.120)
2022-05-21T09:36:54.431Z INFO node-state-poller-0 com.facebook.presto.metadata.DiscoveryNodeManager Previously active node is missing: presto-worker-44c072f1-6120-21f8-45d5-fe1b63da89f0 (last seen at 10.0.167.120)
2022-05-21T09:37:34.436Z INFO node-state-poller-0 com.facebook.presto.metadata.DiscoveryNodeManager Previously active node is missing: presto-resource-manager-64c072f1-60b3-c0b2-7fd8-9a6a910abb23 (last seen at 10.0.132.124)
2022-05-21T09:37:34.436Z INFO node-state-poller-0 com.facebook.presto.metadata.DiscoveryNodeManager Previously active node is missing: presto-worker-44c072f1-6120-21f8-45d5-fe1b63da89f0 (last seen at 10.0.167.120)
2022-05-21T09:38:04.439Z INFO node-state-poller-0 com.facebook.presto.metadata.DiscoveryNodeManager Previously active node is missing: presto-worker-e6c072f1-6128-99d4-45d0-dd5703d4ed08 (last seen at 10.0.147.62)
2022-05-21T09:38:14.441Z INFO node-state-poller-0 com.facebook.presto.metadata.DiscoveryNodeManager Previously active node is missing: presto-worker-44c072f1-6120-21f8-45d5-fe1b63da89f0 (last seen at 10.0.167.120)
2022-05-21T09:38:44.445Z INFO node-state-poller-0 com.facebook.presto.metadata.DiscoveryNodeManager Previously active node is missing: presto-resource-manager-64c072f1-60b3-c0b2-7fd8-9a6a910abb23 (last seen at 10.0.132.124)
I can confirm that the above nodes are reachable all the time. Is there a configurable sync time for state to be shared among different discovery server instances ? Or could there be some other reason for this ?
Thanks
@tdcmeehan @swapsmagic I have the same question. Is there SPOF for discovery service ? Or how to config multiple resource managers in a cluster ?
In disagg coordinator architecture, discovery server runs on resource manager. To avoid SPOF you need to run multiple resource managers. To setup them, you need to put them behind a vip and provide that in discovery.uri as mentioned in the doc: discovery.uri=http://example.net:8080 (Point to resource manager host/vip)
@ShubhamChaurasia Can you share your configuration that you have used for all the component, and will take a look at it. Can't think of a reason why this is happening but will check again along with the configuration used.
@swapsmagic Sure, please find the configs below -
Resource Manager
coordinator=false
resource-manager=true
resource-manager-enabled=true
discovery-server.enabled=true
http-server.threads.max=500
sink.max-buffer-size=1GB
query.max-memory=14746MB
query.max-memory-per-node=6601364734B
query.max-total-memory-per-node=7921637681B
query.max-history=40
query.min-expire-age=30m
query.client.timeout=30m
query.stage-count-warning-threshold=100
query.max-stage-count=150
http-server.http.port=8080
discovery.uri=${DISCOVERY_URI}
Coordinator
coordinator=true
node-scheduler.include-coordinator=false
resource-manager-enabled=true
discovery-server.enabled=false
http-server.threads.max=500
sink.max-buffer-size=1GB
query.max-memory=14746MB
query.max-memory-per-node=6601364734B
query.max-total-memory-per-node=7921637681B
query.max-history=40
query.min-expire-age=30m
query.client.timeout=30m
query.stage-count-warning-threshold=100
query.max-stage-count=150
http-server.http.port=8080
discovery.uri=${DISCOVERY_URI}
Worker
coordinator=false
resource-manager-enabled=true
http-server.threads.max=500
sink.max-buffer-size=1GB
query.max-memory=14746MB
query.max-memory-per-node=6601364734B
query.max-total-memory-per-node=7921637681B
query.max-history=40
query.min-expire-age=30m
query.client.timeout=30m
query.stage-count-warning-threshold=100
query.max-stage-count=150
http-server.http.port=8080
discovery.uri=${DISCOVERY_URI}
DISCOVERY_URI is a load balancer URI behind which we have two instances of RM.
I can‘t find any doc abount this feature from the offcial website ( https://prestodb.io/docs/current/). Is this feature released ?
@swapsmagic @cliandy Any luck with number of nodes issue mentioned above ?
Hello, I would like to ask a question, in the distributed resource manager, can only one resource manager work? I now have two resource managers deployed, two coordinators deployed, and multiple workers. I can't see all the workers in each coordinator, but the number of workers in both coordinators adds up to the number of workers that I deploy all the workers. Why is this? @tdcmeehan
@tdcmeehan @swapsmagic Is there a top level PR or Issue where I can find all the RM related PRs/Commits ? https://github.com/prestodb/presto/pull/15479/commits contains some of them but I guess it's not the complete list.
So far I have been able to locate the following -
Are there any more of them ?
First of all, this splendid feature was based on Thrift
communications: https://github.com/prestodb/presto/pull/13894
Recent PRs about this issue, which we can track the commits by Mr Tim Meehan
and Swapnil Tailor
:
@tdcmeehan @swapsmagic I have the same question. Is there SPOF for discovery service ? Or how to config multiple resource managers in a cluster ?
In disagg coordinator architecture, discovery server runs on resource manager. To avoid SPOF you need to run multiple resource managers. To setup them, you need to put them behind a vip and provide that in discovery.uri as mentioned in the doc: discovery.uri=http://example.net:8080 (Point to resource manager host/vip)
Will info of node states be synchronized between multiple resource managers? Will the previously running tasks on the cluster be killed when VIP switches?
I was trying to achieve high availability using multiple Presto coordinators (hot/warm deployment) so that if one coordinator crashes, the other will serve the queries. To implement this, I was planning to externalize the discovery service so that coordinators will reach out to the discovery server to get the active worker counts. In case of a coordinator failure, queries will be redirected to the standby coordinator via a load balancer. I am looking for a solution to externalize the discovery server from the coordinators. Does anyone have any clue about this? Is it possible to bring up a Presto container and convert that to a dedicated discovery server?
Presto Disaggregated Coordinator
Motivation
In Facebook, we find that coordinators have become bottlenecks to cluster scalability in a number of ways.
1) Once we scale to some number of nodes in the hundreds, we find that coordinators frequently have a high potential of being slowed down by an excessively high number of tasks. 2) In certain high QPS use cases, we have found that workers can become starved of splits, by excessive CPU being spent on task updates. This bottleneck in the coordinator is alleviated by reducing the concurrency, but this leaves the cluster under-utilized.
Furthermore, because of the de-facto constraint that there must be one coordinator per cluster, this limits the size of the worker pool to whatever number can handle the QPS from conditions 1 and 2. This means it’s very difficult to deploy large high-QPS clusters than can take on queries of moderate complexity (such as high stage count queries).
In short, tasks are observed to be the chief bottleneck in scaling the coordinator. As tasks are themselves split up by query, a natural approach to scale the coordinator is to send queries to multiple coordinators.
Architecture
Query Execution
Notes
Flow of execution
UI