trinodb / trino-gateway

https://trinodb.github.io/trino-gateway/
Apache License 2.0
144 stars 59 forks source link

Sticky routing based on next-uri fields #446

Open shk3 opened 3 weeks ago

shk3 commented 3 weeks ago

Hi folks,

Have we considered rewriting next-uri and info-uri directly from the responses in order to achieve query-level sticky routing?

The idea is kinda similar to Trino Proxy, where Trino Gateway proxies all requests. Then we bind the URLs in the following ways:

With this approach, for query-level sticky routing, we don't need to track which backend each query id gets assigned to. Instead, such assignment is retained on the client side.

The caveat is that for the Trino UI, we would need to develop a way for users to do a combined search queries across all backends as well as a summary of all backend's stats.

Has this approach been considered in the past? We could eliminate the dependency on the databases / caches. If cross-regional networking could be a concern, we could even change the URLs with different domains to avoid inter-regional proxying.

I know Trino Gateway's architecture is pretty much set, so it's not necessarily something we have to do now, but mostly a discussion just in case later on it's needed.

George

xkrogen commented 3 weeks ago

We talked a bit about making the GW more of a "full proxy" in one of the recent GW dev syncs. It potentially unlocks a lot of new capabilities.

I like the idea you've proposed here of embedding this state into the client instead of storing it on the GW side. Tracking when a query has finished, and thus its state can be cleaned up, is an annoying process. Right now we just have a periodic task, every 2 hours, to clear our query records older than a configurable time window (but that query may actually still be running!): https://github.com/trinodb/trino-gateway/blob/f50b09d5f81ccf5c72efce345a4235727c879c06/gateway-ha/src/main/java/io/trino/gateway/ha/persistence/JdbcConnectionManager.java#L72-L82

Moving it to the client is in line with Trino philosophy in general, IMO, like how we implement session properties and prepared statements on the client-side.

For the UI, I think as you said, we could do a fan-out that pulls query results from each backend ... That also has the benefit of not having two copies of the same data (query IDs / query history stored on both GW and Coordinator).

Curious to hear what others think, but personally at first pass I like the idea. One thing we should consider is whether this would make it harder to implement other new functionality in the future.

shk3 commented 2 weeks ago

One thing we should consider is whether this would make it harder to implement other new functionality in the future.

Yes! This is the exact concern I have too.

We evaluated Trino Gateway vs running Envoy with a query ID cache vs just getting a thin layer of rewriting headers for next-uri in combination with some cloud load balancers a while ago. It's great to see that Trino Gateway is now officially part of Trino project and is collaborating with Trino!

We could actually achieve this next-uri design even as of today with the current Trino Gateway, if we tweak the X-Forwarded-* headers rewriting logic in some way and put the Trino coordinators on their own domains (eg. trino-gw.mydomain, trino-1.mydomain, trino-2.mydomain). In this way, Trino Gateway effectively acts as a query dispatcher, and the subsequent calls won't go through Trino Gateway. However, I'm worried about creating yet-another a snowflake use case for Trino Gateway. So, let's see if this idea could fit into Trino Gateway's bigger design in anyway and doesn't break any functionality Trino Gateway wants to support.

oneonestar commented 2 weeks ago

I had been thinking about routing using QueryID. When Trino coordinator starts, it generates a random coordinatorId and embed it into the last part in QueryID. (ref)

If we can keep track of the coordinatorId for each cluster, we can route it to the corresponding cluster without any additional info.

For example, all the query ID from the same coordinator have the same suffix:

Cluster A (tr8tg):
20240801_040236_47295_tr8tg
20240801_040244_44562_tr8tg
20240801_040245_41234_tr8tg

Cluster B (fejs4):
20240801_040301_24461_fejs4
20240801_040302_21235_fejs4
20240801_040303_25678_fejs4