Closed tooptoop4 closed 3 weeks ago
This is the relevant chunk of code. It looks to me like it only chooses a different address after one has failed, and even then the maximum backoff time is 60 seconds before it goes back to using the original address. Note that the Backoff class referenced is the inner class at the bottom of the file, not the standalone Backoff class in trino-main.
cc @losipiuk
@mdschwarz You are right. It chooses the other address only when the first one has failed. As a workaround, we added a proxy layer in front of our metastore services and start pointing hive.metastore.uri to the proxy.
The logic to "stick" to a working metastore address is on done on purpose. It is done this way to avoid failing Trino query in a situation when we have multiple metastore addresses listed and many of those are actually not working. If we go round robin in such case it is very possible to run out of retry limit, before getting to a working metastore instance. We've seen that happen in practice.
If you want load balancing across multiple HMS server the better practice is probably to use only one URI for the metastore config in the catalog file. And then have that one URI point to an actual load balancer that distributes the load evenly across your identical HMS instances that as set up as a high availability HMS.
@mosabua that's exactly opposite of what i suggested on slack recently. Yes -- it would work But no -- users shouldn't need to put a load balancer in front of HMS if Trino supports load balancing internally.
(And if we want to go down "it's recommended to use a load balancer" route, we should deprecate/remove support for multiple metastore URIs.)
Personally I think the load balancing aspect should be handled outside of Trino and we should remove the support of multiple URIs. And we do know that using an external loadbalancer in front of a HMS works well and simplifies things for Trino.
But I don't know how Trino treats such external dependencies in general either. It seems inconsistent to some degree. E.g. for JDBC connections we only connect to one db typically and leave the clustering and such to the database. On the other hand for Elasticsearch we do distribute the load. And we do it for reading from Hive. Either approach can make sense.
Another example are probably external systems like LDAP for auth .. I dont think we have an internal load balancing to multiple external servers either.
@mdschwarz You are right. It chooses the other address only when the first one has failed. As a workaround, we added a proxy layer in front of our metastore services and start pointing hive.metastore.uri to the proxy.
@mdschwarz May I ask, What should the value of hive.metastore.service.principal
be at this time? and do I need to do anything between hive and kerbeors?
I set it to hive/_HOST@REALM and hive/proxy_dns_host@REALM, and both will report an error: io.trino.spi.TrinoException: Failed connecting to Hive metastore: [proxy_dns_host:9083] at io.trino.plugin.hive.metastore.thrift.TokenFetchingMetastoreClientFactory.loadDelegationToken(TokenFetchingMetastoreClientFactory.java:113) ... Caused by: org.apache.thrift.transport.TTransportException: proxy_dns_host:9083: Peer indicated failure: GSS initiate failed
"It's not exactly round robin, as far as I could tell from the code and from empirical observation. The implementation in StaticMetastoreLocator looks like it only moves on to a secondary uri on errors, and only temporarily. In practice we've been seeing only two get used, even under stress with sporadic query failures due to metatore SocketTimeoutExceptions."