Open dain opened 5 years ago
We hope this patch will also:
This approach requires "Multiple coordinators in a cluster" implementation which I think would be a difficult task. The cooperative scheduling of workers between coordinators could be complicated. I'd rather prefer taking Path 3 first and then move toward the outcome of Path 1.
I don't like this approach. The main issue is that the client has to decide which coordinator to connect or we have to add a load balancer in front of the coordinators. Also, this doesn't allow Dispatcher-only mode and we must always have two+ coordinators online.
The multiple clusters model allow 0 downtime rollout upgrade become easy. The most important advantage of this path I think is it split the big problem (highly available Presto) into many smaller problems. After the dispatcher module being implemented, different people can work on different followup projects including load balancing algorithm, multiple dispatchers, multiple coordinators in a cluster, etc.
I'll vote for Path 3 for its extensibility and step-by-step sprint development approach.
@oneonestar:
- Simplify the collaboration with resource orchestration platforms (K8S / Mesos)
This isn't one of my goals, but if this helps great.
- Allow 0 downtime rollout upgrade
All three of the designs will allow for this. For multi cluster it is straight forward; shut down one cluster, upgrade it and bring it back online. For multi coordinator, it is similar. The two key things to know is Presto coordinators and workers must have the same version, and there is a config option for a minimum number of workers. So you can upgrade a single coordinator, and the coordinator will become visible to the dispatchers, but it won't accept queries until it has enough workers it can use. Then you simply upgrade workers one at a time. When the number of upgraded workers cross the threshold, the coordinator starts excepting work. The full solution is a bit more complex.
- Allow load balancing between coordinators (for high concurrency queries where coordinator becomes the bottleneck)
Only solutions with "Multiple coordinators" would do this.
Path 1: Multiple coordinators with Shared Queue This approach requires "Multiple coordinators in a cluster" implementation which I think would be a difficult task. The cooperative scheduling of workers between coordinators could be complicated. I'd rather prefer taking Path 3 first and then move toward the outcome of Path 1.
From the first version of Presto, the system has always done "cooperative scheduling of workers between coordinators", so this isn't too difficult. Actually, there was a period at FB where we made all workers coordinators by accident, and everything just worked.
The complex part is really around decisions that must be globally made. Specifically, which query to promote to the reserved pool, or which query to kill in low memory, must be coordinated to prevent multiple coordinators making conflicting decisions at the same time.
Path 2: Highly Available Coordinators I don't like this approach. The main issue is that the client has to decide which coordinator to connect or we have to add a load balancer in front of the coordinators. Also, this doesn't allow Dispatcher-only mode and we must always have two+ coordinators online.
In this design, the client contacts a dispatcher service which happens to be running in the same process as the coordinators. This is "Multiple dispatchers" project mentioned above, and it requires "Proxy for dumb load balancer". Anything that has HA for the front end will require some kind of dumb load balancer, but even DNS should work.
In this design, you could scale down all the way to a single coordinator (containing a dispatcher). The dispatcher service would accept and queue queries, but would not start them until enough workers showed up. If the dispatcher failed, another one would need to be started before the clients timed out, but assuming it did, it would rebuild the queues and the clients continue.
Path 3: Multiple Clusters with Shared Queue The multiple clusters model allow 0 downtime rollout upgrade become easy. The most important advantage of this path I think is it split the big problem (highly available Presto) into many smaller problems. After the dispatcher module being implemented, different people can work on different followup projects including load balancing algorithm, multiple dispatchers, multiple coordinators in a cluster, etc.
I think those points are true of all of the above designs.
My main concern with approach 3 is I think it might only be good for really large users that have multiple clusters and are ok with and additional tier of dispatcher machines. Additionally, I don't think is actually makes Presto more highly available, which is what people have been asking for. Anyway, I am happy to work on whatever the community wants done first.
Something that could be considered for the first iteration of HA for the coordinators is to implement a system similar to what Hashicorp's Vault uses. Where there is only ever one Coordinator or Master node active. Leadership election is used to handle which coordinator is that leader and it handles all of the coordinator processes and the healthcheck endpoint for the leader returns 200 so that a VIP knows that it should be active.
The other nodes return a different HTTP Code (Most likely it makes sense in this case to be 503 Service Unavailable) so that those nodes would be unavailable in any VIPs they reside behind. Otherwise they could just trigger a 307 temporary redirect to the leader coordinator.
That way the overall code for how a coordinator works is the same. It just gets wrapped up in a leadership election to determine whoe should actually service the request.
Is this feature already planned? We expect to be able to apply presto to ETL, which requires a stable cluster and task queue
@wpf375516041 Facebook uses Presto extensively for ETL without this feature. This is accomplished by running queries using an external scheduler (similar to Apache Airflow) that handles retry on failure. If the coordinator restarts (or is moved to a different machine), the queue is lost, which is bad, but the queries will still run because the scheduler retries them.
This feature helps, but you still want external retry anyway: queries fail for various reasons, the machine submitting the query crashes, etc.
@electrum Thank you very much for your reply. How can I handle it better for long-running tasks in the event of a queue loss? Is it possible to provide a shared task queue? I think this may bring a very big change, without having to repeat the parts of the plan that have already been executed?
The “shared queue” is only for queuing queries before they start executing. Once they start executing, they run on a single coordinator, exactly as they do today.
For long running queries, it is important to have reliable hardware. Failures should be rare. Use enough machines and concurrency so that the wall time is no more than a few hours. If you have queries longer than this, try to break them up into shorter queries.
Is this feature already planned?
Has there been any more progress made on high availability?
Hi All
Is there any progress on this? We are looking to use Trino but not having a high-available coordinator setup is really holding us back.
Do you have a solution? We have the same needs @lukasCoppens
We found a way to do it. But a native implementation would be far better. Here is how we did it:
On the coordinator nodes we set discovery.uri=http://localhost:8079
. This way all the coordinators only see them selves.
Then we setup a dns record with health-checks that load-balances over the coordinator nodes. We then also run a HA-proxy on every coordinator node:
backend trino
description "Trino"
balance first
server coordinator1 xxx.xxx.xxx.xxx:8079 check
server coordinator2 xxx.xxx.xxx.xxx:8079 check backup
server coordinator3 xxx.xxx.xxx.xxx:8079 check backup
For the worker nodes, we set discovery.uri=http://ha-dns.example:5555
where 5555 is the HA-proxy port.
We use the dns record method because we already had one configured for other services on these nodes so it seemed logical to follow the same setup. If you do not have health-check dns available, you could also run HA-proxy on the worker nodes and have the active-backup config in place there. Then you configure the worker nodes to connect to the coordinator using localhost and the HA-proxy port.
This setup will cause queries to fail when the coordinator goes down but at least you will be able to restart it on another coordinator. It's not true HA but its a start.
This is a good way. However, if coordinator fails, the running SQL will not know the status. Moreover, it's not sure that webui will work properly
Hi All, Is this feature available in any version or planned in any upcoming versions?
Is there any progress on this issue?
Hi, This feature is planned in any upcoming versions ? Could be nice to have this
We found a way to do it. But a native implementation would be far better. Here is how we did it:
On the coordinator nodes we set
discovery.uri=http://localhost:8079
. This way all the coordinators only see them selves. Then we setup a dns record with health-checks that load-balances over the coordinator nodes. We then also run a HA-proxy on every coordinator node:backend trino description "Trino" balance first server coordinator1 xxx.xxx.xxx.xxx:8079 check server coordinator2 xxx.xxx.xxx.xxx:8079 check backup server coordinator3 xxx.xxx.xxx.xxx:8079 check backup
For the worker nodes, we set
discovery.uri=http://ha-dns.example:5555
where 5555 is the HA-proxy port.We use the dns record method because we already had one configured for other services on these nodes so it seemed logical to follow the same setup. If you do not have health-check dns available, you could also run HA-proxy on the worker nodes and have the active-backup config in place there. Then you configure the worker nodes to connect to the coordinator using localhost and the HA-proxy port.
This setup will cause queries to fail when the coordinator goes down but at least you will be able to restart it on another coordinator. It's not true HA but its a start.
Hi @lukasCoppens Can you share the implementation details in a seperate git link so as to refer the implementation
Any Plan to support multiple coordinators in a cluster to Reduce impact of crashed coordinator?
For now the Starburst Enterprise version supports Coordinator HA:
https://docs.starburst.io/360-e/aws/operation.html#coordinator-high-availability
And PrestoDB uses Thrift
communication and a new service called ResourceManager
to support this: https://github.com/prestodb/presto/issues/15453
And HUAWEI OpenLooKeng uses hazelcast
(https://github.com/hazelcast/hazelcast) as an external state store(which is similar with ResourceManager in PrestoDB) to support coordinator HA: https://github.com/openlookeng/hetu-core/blob/master/hetu-docs/en/admin/state-store.md
What about launching two coordinators in a Kubernetes cluster (using Trino Helm chart)
Are there any updates on launching two coordinators in a Kubernetes cluster using Trino Helm Chart as described above?
For anyone who has issue with HA and/or graceful shutdown issue while manining serveral trino clusters, you can set Trino Gateway, a lb/proxy server/routing gateway that resolves the issues. Please do checkout Trino blog post and Trino Summit 2023 presentation coming soon to know more about it.
Multiple coordinators in a cluster
I believe that it's really important to have better way of updating configurations. Currently in K8s usually it works by using some mounted configuration files, but that requires restarting coordinator pods - which again currently kills all ongoing queries.
@PuszekSE adding/removing nodes to the cluster is already supported. Modifying a configuration in the flight is totally different story and it won't be supported anytime soon. We only plan to support dynamic reloading of catalog definitions (with support for plugabble secret providers).
Goals
At the highest level, our goal is to create a highly available Trino setups. More specifically, we would like to accomplish the following:
Work Items
Deployment Architectures
There are multiple ways to combine the above work items into different setups to achieve different goals. The following sections describes the most common setups:
Multiple coordinators with Shared Queue
In this setup there is a single dispatcher containing the shared queue, and multiple coordinators in the cluster. If the dispatcher crashes, all queued queries fail, but executing queries will continue. If a coordinator crashes, all queries managed by that coordinator fail, but other queued queries and queries managed by other coordinators will continue.
This setup requires the following:
Highly Available Coordinators
In this setup there are multiple coordinators and each contains a dispatcher. Queued queries are durable, but executing queries can fail.
This setup requires the following:
Multiple Clusters with Shared Queue
In this setup there is a dispatcher tier that is managing a shared queue for multiple clusters. In the simplest form, there is a single dispatcher for all clusters and a single coordinator for each cluster. As above, if either fails, it only fails the queries being managed by that instance. This project has multiple followup projects to make the dispatcher HA, and to allow multiple coordinators in a cluster.
This setup requires the following: