Open cdw9bf opened 5 years ago
This sounds like a good feature. There are some implementation difficulties around how SemiTransactionalHiveMetastore
works. I need to think about this more before writing a longer response.
Another implementation issue is how to make the credentials configurable per request in the AWS client without needing to create a new client (with associated connection pools and executors) for each request. Looking at the code, it calls the credentials provider for each request, so (unfortunately) the best way might be to use a thread local to wrapper the calls with the correct credentials.
My impression from reading through the code is that this wouldn't be a trivial change.
I'll admit I'm pretty novice when it comes to Presto - would it be possible to pass in the configuration details of glue in HiveConnectorFactory
where it initializes the connector? Here. Since it loads it in a Thread Context, does each user(/query?) has their own version of the object where the specific credentials could be specified?
The backend implementation of the metastore is bound in HiveMetastoreModule
based on configuration. For Glue, it will bind GlueHiveMetastore
for the HiveMetastore
interface.
This backend instance gets injected into HiveMetadataFactory
, which is used to create instances of SemiTransactionalHiveMetastore
for each transaction, triggered by HiveConnector.beginTransaction()
.
I'm thinking we can change to a model where the HiveMetastore
instance is created once per transaction. Instead of binding the backend HiveMetastore
instance, we create a new interface HiveMetastoreFactory
and bind an implementation of that, e.g., GlueHiveMetastoreFactory
. That factory would hold the global shared objects like the heavyweight AWS client.
One problem is that Connector.beginTransaction()
does not take ConnectorSession
, but that's simple to add. That was probably an oversight when we added transaction support. In order to ensure that the username does not change during the transaction (which seems strange but is completely allowed), we can make HiveMetadata
implement the ConnectorMetadata.beginQuery()
method, which would throw if the username changed.
To get the ConnectorSession
object in the beginTransaction
method, I assume that would require changing the interface of Connector
?
I was also looking through where the beginTransaction
method was called and I'm not sure that I see where we can pass the ConnectorSession
object to it. The class InMemoryTransactionManager
has the function createConnectorTransactionMetadata(CatalogName catalogName, Catalog catalog)
that calls the beginTransaction
method. From there, it looks like one of the places that it is called is in the Session.beginTransactionId
, which appears to be the the object that you'd call to get the ConnectorSession
?
So along your line of thinking, the beginTransaction
method, we would call the a method like HiveMetastoreFactory.get(ConnectorSession connectorSession)
and the HiveMetastoreFactory
would be an interface with each of the metastore types implementing their own version. Then we could modify the HiveConnector
constructor to take the factory as an argument so we can create the injector in the HiveModule.configure
method and retrieve it in the HiveConnectorFactory.create
method.
Another implementation issue is how to make the credentials configurable per request in the AWS client without needing to create a new client (with associated connection pools and executors) for each request. Looking at the code, it calls the credentials provider for each request, so (unfortunately) the best way might be to use a thread local to wrapper the calls with the correct credentials.
That’s actually not quite true, the base AWS SDK request type supports setting a credential provider per request. The client credential provider is only called and used if the request doesn’t have one set directly by the time execution is starting, so there shouldn’t be a strict need for using a client level ThreadLocal to make that possible.
@pettyjamesm You're right, the request object passed to each method extends AmazonWebServiceRequest
which has setRequestCredentialsProvider()
. Thanks for pointing that out, it makes this much easier.
I think the better method would be withRequestCredentialsProvider()
as it would allow for a small modification of the builder chain of each new <method>Request()
(ex here). Regardless, it definitely looks like we can set it on a per request basis.
Instead of setting the glue
metastore on a LocalThread object, we could then create a credentials factory and get the credential provider by specifying a PrincipalPrivileges
class with each method. However I'm thinking that it would require us to modify the HiveMetastore
interface to pass the PrincipalPrivileges
. The other idea is to create the thread local GlueHiveMetastore
and initialize it with the AsyncGlueClient
and a set of custom credentials for that specific user/transaction then attach the credentials to each request.
Personally I think I like the latter better as it wouldn't require such extensive changes to the interface unless you see another way that I am missing?
Agreed, the with
version is better as it allows chaining.
Rather than changing the HiveMetastore
interface, I think we should proceed with the HiveMetastoreFactory
idea where we create it once per transaction, after making the required changes to HiveConnector.beginTransaction()
to pass the session. I can look into the difficulty of changing the transaction manager to allow this.
There's a circular dependency on passing ConnectorSession
to Connector.beginTransaction()
. The session contains catalog session properties, which have to be validated using ConnectorAccessControl.checkCanSetCatalogSessionProperty()
before the session is created.
When using AWS Glue as the metastore for Hive, it would be useful to be able to assume different IAM roles for different users. It would allow for system administrators the ability to create and delete tables while regular users are only allowed to query from said tables. Another use is when doing a
SHOW SCHEMAS IN hive
query, one can set the IAM role to only list tables that the user has permissions to read. See this AWS document for some more in-depth examples. It would keep all of the permissions inside of the AWS ecosystem instead of creating another authentication layer inside of Presto.Right now, Presto assumes the role attached to the instance to make glue API calls giving everyone in the cluster access to the same permissions.
My initial idea is to modify the metastore connector to be similar to that of the FileSystem class where a new client is created with every unique request (unique in this case would probably be via user). The system would be similar to the S3 authentication where one can specific
hive.use-instance-credentials
or specify a custom credential provider for the class to use.Would this be a feature that would be possible to implement in Presto? I'm thinking it would have to happen metastore object is called but I'm not sure what the implications for other metastores and how to handle the fact that others will be using static system credentials.