dain commented 5 years ago

Goals

At the highest level, our goal is to create a highly available Trino setups. More specifically, we would like to accomplish the following:

Reduce impact of crashed coordinator
Simplify deployment architecture (e.g., fast failover, load balancers, etc.)
Reduce failures due to upgrades

Work Items

~Dispatcher split~
- ~Move queue management into separate internal service (#95)~
- ~Secure submission query API~
- ~Split queued and executing client API~
Dispatcher-only or coordinator-only server
- Add only start necessary services depending on server role
Multiple dispatchers
- Proxy for dumb load balancer
- Durable resource groups
Multiple coordinators in a cluster
- Add configuration to run shared discovery across coordinators
- Add coordination for OOM killer and reserved memory pool
Multiple clusters
- Cluster/coordinator discovery (each cluster has independent discovery)
- Cluster selection system for queries

Deployment Architectures

There are multiple ways to combine the above work items into different setups to achieve different goals. The following sections describes the most common setups:

Multiple coordinators with Shared Queue

In this setup there is a single dispatcher containing the shared queue, and multiple coordinators in the cluster. If the dispatcher crashes, all queued queries fail, but executing queries will continue. If a coordinator crashes, all queries managed by that coordinator fail, but other queued queries and queries managed by other coordinators will continue.

This setup requires the following:

Dispatcher split
Dispatcher-only or coordinator-only server
Multiple coordinators in a cluster
(Optional) Multiple dispatchers

Highly Available Coordinators

In this setup there are multiple coordinators and each contains a dispatcher. Queued queries are durable, but executing queries can fail.

This setup requires the following:

Dispatcher split
Multiple dispatchers
Multiple coordinators in a cluster

Multiple Clusters with Shared Queue

In this setup there is a dispatcher tier that is managing a shared queue for multiple clusters. In the simplest form, there is a single dispatcher for all clusters and a single coordinator for each cluster. As above, if either fails, it only fails the queries being managed by that instance. This project has multiple followup projects to make the dispatcher HA, and to allow multiple coordinators in a cluster.

This setup requires the following:

Dispatcher split
Dispatcher-only or coordinator-only server
Multiple clusters
(Optional) Multiple dispatchers
(Optional) Multiple coordinators

oneonestar commented 5 years ago

We hope this patch will also:

Simplify the collaboration with resource orchestration platforms (K8S / Mesos)
Allow 0 downtime rollout upgrade
Allow load balancing between coordinators (for high concurrency queries where coordinator becomes the bottleneck)

Path 1: Multiple coordinators with Shared Queue

This approach requires "Multiple coordinators in a cluster" implementation which I think would be a difficult task. The cooperative scheduling of workers between coordinators could be complicated. I'd rather prefer taking Path 3 first and then move toward the outcome of Path 1.

Path 2: Highly Available Coordinators

I don't like this approach. The main issue is that the client has to decide which coordinator to connect or we have to add a load balancer in front of the coordinators. Also, this doesn't allow Dispatcher-only mode and we must always have two+ coordinators online.

Path 3: Multiple Clusters with Shared Queue

The multiple clusters model allow 0 downtime rollout upgrade become easy. The most important advantage of this path I think is it split the big problem (highly available Presto) into many smaller problems. After the dispatcher module being implemented, different people can work on different followup projects including load balancing algorithm, multiple dispatchers, multiple coordinators in a cluster, etc.

I'll vote for Path 3 for its extensibility and step-by-step sprint development approach.

dain commented 5 years ago

@oneonestar:

Simplify the collaboration with resource orchestration platforms (K8S / Mesos)

This isn't one of my goals, but if this helps great.

Allow 0 downtime rollout upgrade

All three of the designs will allow for this. For multi cluster it is straight forward; shut down one cluster, upgrade it and bring it back online. For multi coordinator, it is similar. The two key things to know is Presto coordinators and workers must have the same version, and there is a config option for a minimum number of workers. So you can upgrade a single coordinator, and the coordinator will become visible to the dispatchers, but it won't accept queries until it has enough workers it can use. Then you simply upgrade workers one at a time. When the number of upgraded workers cross the threshold, the coordinator starts excepting work. The full solution is a bit more complex.

Allow load balancing between coordinators (for high concurrency queries where coordinator becomes the bottleneck)

Only solutions with "Multiple coordinators" would do this.

Path 1: Multiple coordinators with Shared Queue This approach requires "Multiple coordinators in a cluster" implementation which I think would be a difficult task. The cooperative scheduling of workers between coordinators could be complicated. I'd rather prefer taking Path 3 first and then move toward the outcome of Path 1.

From the first version of Presto, the system has always done "cooperative scheduling of workers between coordinators", so this isn't too difficult. Actually, there was a period at FB where we made all workers coordinators by accident, and everything just worked.

The complex part is really around decisions that must be globally made. Specifically, which query to promote to the reserved pool, or which query to kill in low memory, must be coordinated to prevent multiple coordinators making conflicting decisions at the same time.

Path 2: Highly Available Coordinators I don't like this approach. The main issue is that the client has to decide which coordinator to connect or we have to add a load balancer in front of the coordinators. Also, this doesn't allow Dispatcher-only mode and we must always have two+ coordinators online.

In this design, the client contacts a dispatcher service which happens to be running in the same process as the coordinators. This is "Multiple dispatchers" project mentioned above, and it requires "Proxy for dumb load balancer". Anything that has HA for the front end will require some kind of dumb load balancer, but even DNS should work.

In this design, you could scale down all the way to a single coordinator (containing a dispatcher). The dispatcher service would accept and queue queries, but would not start them until enough workers showed up. If the dispatcher failed, another one would need to be started before the clients timed out, but assuming it did, it would rebuild the queues and the clients continue.

Path 3: Multiple Clusters with Shared Queue The multiple clusters model allow 0 downtime rollout upgrade become easy. The most important advantage of this path I think is it split the big problem (highly available Presto) into many smaller problems. After the dispatcher module being implemented, different people can work on different followup projects including load balancing algorithm, multiple dispatchers, multiple coordinators in a cluster, etc.

I think those points are true of all of the above designs.

My main concern with approach 3 is I think it might only be good for really large users that have multiple clusters and are ok with and additional tier of dispatcher machines. Additionally, I don't think is actually makes Presto more highly available, which is what people have been asking for. Anyway, I am happy to work on whatever the community wants done first.

Ralnoc commented 5 years ago

Something that could be considered for the first iteration of HA for the coordinators is to implement a system similar to what Hashicorp's Vault uses. Where there is only ever one Coordinator or Master node active. Leadership election is used to handle which coordinator is that leader and it handles all of the coordinator processes and the healthcheck endpoint for the leader returns 200 so that a VIP knows that it should be active.

The other nodes return a different HTTP Code (Most likely it makes sense in this case to be 503 Service Unavailable) so that those nodes would be unavailable in any VIPs they reside behind. Otherwise they could just trigger a 307 temporary redirect to the leader coordinator.

That way the overall code for how a coordinator works is the same. It just gets wrapped up in a leadership election to determine whoe should actually service the request.

wpf375516041 commented 5 years ago

Is this feature already planned? We expect to be able to apply presto to ETL, which requires a stable cluster and task queue

electrum commented 5 years ago

@wpf375516041 Facebook uses Presto extensively for ETL without this feature. This is accomplished by running queries using an external scheduler (similar to Apache Airflow) that handles retry on failure. If the coordinator restarts (or is moved to a different machine), the queue is lost, which is bad, but the queries will still run because the scheduler retries them.

This feature helps, but you still want external retry anyway: queries fail for various reasons, the machine submitting the query crashes, etc.

wpf375516041 commented 5 years ago

@electrum Thank you very much for your reply. How can I handle it better for long-running tasks in the event of a queue loss? Is it possible to provide a shared task queue? I think this may bring a very big change, without having to repeat the parts of the plan that have already been executed?

electrum commented 5 years ago

The “shared queue” is only for queuing queries before they start executing. Once they start executing, they run on a single coordinator, exactly as they do today.

For long running queries, it is important to have reliable hardware. Failures should be rare. Use enough machines and concurrency so that the wall time is no more than a few hours. If you have queries longer than this, try to break them up into shorter queries.

caneGuy commented 4 years ago

Is this feature already planned?

elduderino1974 commented 4 years ago

Has there been any more progress made on high availability?

lukasCoppens commented 3 years ago

Hi All

Is there any progress on this? We are looking to use Trino but not having a high-available coordinator setup is really holding us back.

zhangchunyang1024 commented 3 years ago

Do you have a solution? We have the same needs @lukasCoppens

lukasCoppens commented 3 years ago

We found a way to do it. But a native implementation would be far better. Here is how we did it:

On the coordinator nodes we set discovery.uri=http://localhost:8079. This way all the coordinators only see them selves. Then we setup a dns record with health-checks that load-balances over the coordinator nodes. We then also run a HA-proxy on every coordinator node:

backend trino
    description "Trino"
    balance first
    server coordinator1 xxx.xxx.xxx.xxx:8079 check
    server coordinator2 xxx.xxx.xxx.xxx:8079 check backup
    server coordinator3 xxx.xxx.xxx.xxx:8079 check backup

For the worker nodes, we set discovery.uri=http://ha-dns.example:5555 where 5555 is the HA-proxy port.

We use the dns record method because we already had one configured for other services on these nodes so it seemed logical to follow the same setup. If you do not have health-check dns available, you could also run HA-proxy on the worker nodes and have the active-backup config in place there. Then you configure the worker nodes to connect to the coordinator using localhost and the HA-proxy port.

This setup will cause queries to fail when the coordinator goes down but at least you will be able to restart it on another coordinator. It's not true HA but its a start.

zhangchunyang1024 commented 3 years ago

This is a good way. However, if coordinator fails, the running SQL will not know the status. Moreover, it's not sure that webui will work properly

lokeshkh92 commented 3 years ago

Hi All, Is this feature available in any version or planned in any upcoming versions?

ddcprg commented 3 years ago

Is there any progress on this issue?

dohr-michael commented 2 years ago

Hi, This feature is planned in any upcoming versions ? Could be nice to have this

patswagh96 commented 2 years ago

We found a way to do it. But a native implementation would be far better. Here is how we did it:

On the coordinator nodes we set discovery.uri=http://localhost:8079. This way all the coordinators only see them selves. Then we setup a dns record with health-checks that load-balances over the coordinator nodes. We then also run a HA-proxy on every coordinator node:
backend trino
    description "Trino"
    balance first
    server coordinator1 xxx.xxx.xxx.xxx:8079 check
    server coordinator2 xxx.xxx.xxx.xxx:8079 check backup
    server coordinator3 xxx.xxx.xxx.xxx:8079 check backup
For the worker nodes, we set discovery.uri=http://ha-dns.example:5555 where 5555 is the HA-proxy port.

We use the dns record method because we already had one configured for other services on these nodes so it seemed logical to follow the same setup. If you do not have health-check dns available, you could also run HA-proxy on the worker nodes and have the active-backup config in place there. Then you configure the worker nodes to connect to the coordinator using localhost and the HA-proxy port.

This setup will cause queries to fail when the coordinator goes down but at least you will be able to restart it on another coordinator. It's not true HA but its a start.

Hi @lukasCoppens Can you share the implementation details in a seperate git link so as to refer the implementation

ellieshen commented 2 years ago

Any Plan to support multiple coordinators in a cluster to Reduce impact of crashed coordinator?

hackeryang commented 1 year ago

For now the Starburst Enterprise version supports Coordinator HA:
https://docs.starburst.io/360-e/aws/operation.html#coordinator-high-availability
And PrestoDB uses Thrift communication and a new service called ResourceManager to support this: https://github.com/prestodb/presto/issues/15453 And HUAWEI OpenLooKeng uses hazelcast(https://github.com/hazelcast/hazelcast) as an external state store(which is similar with ResourceManager in PrestoDB) to support coordinator HA: https://github.com/openlookeng/hetu-core/blob/master/hetu-docs/en/admin/state-store.md

gfoligna-nyshex commented 1 year ago

What about launching two coordinators in a Kubernetes cluster (using Trino Helm chart)

gui-elastic commented 1 year ago

Are there any updates on launching two coordinators in a Kubernetes cluster using Trino Helm Chart as described above?

Chaho12 commented 11 months ago

For anyone who has issue with HA and/or graceful shutdown issue while manining serveral trino clusters, you can set Trino Gateway, a lb/proxy server/routing gateway that resolves the issues. Please do checkout Trino blog post and Trino Summit 2023 presentation coming soon to know more about it.

PuszekSE commented 3 months ago

Multiple coordinators in a cluster

Add configuration to run shared discovery across coordinators

I believe that it's really important to have better way of updating configurations. Currently in K8s usually it works by using some mounted configuration files, but that requires restarting coordinator pods - which again currently kills all ongoing queries.

wendigo commented 3 months ago

@PuszekSE adding/removing nodes to the cluster is already supported. Modifying a configuration in the flight is totally different story and it won't be supported anytime soon. We only plan to support dynamic reloading of catalog definitions (with support for plugabble secret providers).

trinodb / trino

High Availability #391

Goals

Work Items

Deployment Architectures

Multiple coordinators with Shared Queue

Highly Available Coordinators

Multiple Clusters with Shared Queue

Path 1: Multiple coordinators with Shared Queue

Path 2: Highly Available Coordinators

Path 3: Multiple Clusters with Shared Queue