Open xiacongling opened 2 years ago
Thanks for detailed writeup 🙌
https://github.com/trinodb/trino/issues/14441 merged, but we believe there needs to be more work, see https://github.com/trinodb/trino/pull/15997 & https://github.com/trinodb/trino/pull/14373#issuecomment-1422394658
Hi, community. We recently encounter a problem that Trino (386) cannot list and query tables in a Kerberized Kudu cluster. The error message is:
Our Trino cluster works well when it is newly started, and a few days later, some of the workers and/or the coordinator start to raise such exceptions. A server restart can temporarily fix the problem and fails still occur days after. We configure the lifetime of Kerberos ticket to be 1 day. That is to say, after a few times of ticket refreshment, the authentication mechanism will fail.
I use the production test environment
multinode-kerberos-kudu
to re-produce the error, using the following steps:testing/bin/ptl env up --environment multinode-kerberos-kudu --debug
show tables from kudu.default;
If you are lucky enough (it cannot be re-produced every time, I will explain later), you will see similar error messages.
The exception stack is as follows:
The error occurs when
KuduClient
starts connecting to Kudu tserver. Kudu connection negotiator will check ticket expiration via the following code:and an error will be raised later from the negotiation. It is inferred that TGT is expired when connection being made. But, we have a delegated
KuduClient
inio.trino.plugin.kudu.KerberizedKuduClient
and before we get the client instance, Trino will check and refresh the ticket via:After step into the code, I find that the
Subject
's credentials are not well managed by Trino. The reauthentication code is:The problem is the first TGT from
subject.privateCredentials
is not always the newest one! Look at the implementation ofgetTicketGrantingTicket()
:The
HashSet
's iteration is unpredictable, so the re-production above is unstable. The mechanism will lead to the following problems:nextRefreshTime
calculated with ticketstartTime
andendTime
will be an time point in the past, thus, Trino will reauthenticate again for the next query;subject.privateCredentials
), and situation may become worse and worse - it is harder and harder for a new ticket to locate in the first place of aHashSet
's iterator, and more and more reauthentication will be send to KDC.