hive.metastore.uris should use round robin instead of preferring first uri

trinodb / trino

Official repository of Trino, the distributed SQL query engine for big data, formerly known as PrestoSQL (https://trino.io)

https://trino.io

Apache License 2.0

10.45k stars 3.01k forks source link

hive.metastore.uris should use round robin instead of preferring first uri #7472

Closed tooptoop4 closed 3 weeks ago

tooptoop4 commented 3 years ago

"It's not exactly round robin, as far as I could tell from the code and from empirical observation. The implementation in StaticMetastoreLocator looks like it only moves on to a secondary uri on errors, and only temporarily. In practice we've been seeing only two get used, even under stress with sporadic query failures due to metatore SocketTimeoutExceptions."

mdschwarz commented 3 years ago

This is the relevant chunk of code. It looks to me like it only chooses a different address after one has failed, and even then the maximum backoff time is 60 seconds before it goes back to using the original address. Note that the Backoff class referenced is the inner class at the bottom of the file, not the standalone Backoff class in trino-main.

https://github.com/trinodb/trino/blob/89e22cc33d21674d94372a0174743bf3d45a0b65/plugin/trino-hive/src/main/java/io/trino/plugin/hive/metastore/thrift/StaticMetastoreLocator.java#L88-L114

findepi commented 3 years ago

cc @losipiuk

ayushbilala commented 3 years ago

@mdschwarz You are right. It chooses the other address only when the first one has failed. As a workaround, we added a proxy layer in front of our metastore services and start pointing hive.metastore.uri to the proxy.

losipiuk commented 3 years ago

The logic to "stick" to a working metastore address is on done on purpose. It is done this way to avoid failing Trino query in a situation when we have multiple metastore addresses listed and many of those are actually not working. If we go round robin in such case it is very possible to run out of retry limit, before getting to a working metastore instance. We've seen that happen in practice.

mosabua commented 3 years ago

If you want load balancing across multiple HMS server the better practice is probably to use only one URI for the metastore config in the catalog file. And then have that one URI point to an actual load balancer that distributes the load evenly across your identical HMS instances that as set up as a high availability HMS.

findepi commented 3 years ago

@mosabua that's exactly opposite of what i suggested on slack recently. Yes -- it would work But no -- users shouldn't need to put a load balancer in front of HMS if Trino supports load balancing internally.

(And if we want to go down "it's recommended to use a load balancer" route, we should deprecate/remove support for multiple metastore URIs.)

mosabua commented 3 years ago

Personally I think the load balancing aspect should be handled outside of Trino and we should remove the support of multiple URIs. And we do know that using an external loadbalancer in front of a HMS works well and simplifies things for Trino.

But I don't know how Trino treats such external dependencies in general either. It seems inconsistent to some degree. E.g. for JDBC connections we only connect to one db typically and leave the clustering and such to the database. On the other hand for Elasticsearch we do distribute the load. And we do it for reading from Hive. Either approach can make sense.

Another example are probably external systems like LDAP for auth .. I dont think we have an internal load balancing to multiple external servers either.

tooptoop4 commented 3 years ago

https://github.com/prestodb/presto/commit/b29aaa3f57cb413d34655e766bf31829f0375c47

akenO8 commented 3 months ago

@mdschwarz You are right. It chooses the other address only when the first one has failed. As a workaround, we added a proxy layer in front of our metastore services and start pointing hive.metastore.uri to the proxy.

@mdschwarz May I ask, What should the value of hive.metastore.service.principal be at this time? and do I need to do anything between hive and kerbeors?

I set it to hive/_HOST@REALM and hive/proxy_dns_host@REALM, and both will report an error: io.trino.spi.TrinoException: Failed connecting to Hive metastore: [proxy_dns_host:9083] at io.trino.plugin.hive.metastore.thrift.TokenFetchingMetastoreClientFactory.loadDelegationToken(TokenFetchingMetastoreClientFactory.java:113) ... Caused by: org.apache.thrift.transport.TTransportException: proxy_dns_host:9083: Peer indicated failure: GSS initiate failed