trinodb / trino

Official repository of Trino, the distributed SQL query engine for big data, formerly known as PrestoSQL (https://trino.io)
https://trino.io
Apache License 2.0
10.46k stars 3.01k forks source link

Custom Credential Provider for Glue Metastore #966

Open cdw9bf opened 5 years ago

cdw9bf commented 5 years ago

When using AWS Glue as the metastore for Hive, it would be useful to be able to assume different IAM roles for different users. It would allow for system administrators the ability to create and delete tables while regular users are only allowed to query from said tables. Another use is when doing a SHOW SCHEMAS IN hive query, one can set the IAM role to only list tables that the user has permissions to read. See this AWS document for some more in-depth examples. It would keep all of the permissions inside of the AWS ecosystem instead of creating another authentication layer inside of Presto.

Right now, Presto assumes the role attached to the instance to make glue API calls giving everyone in the cluster access to the same permissions.

My initial idea is to modify the metastore connector to be similar to that of the FileSystem class where a new client is created with every unique request (unique in this case would probably be via user). The system would be similar to the S3 authentication where one can specific hive.use-instance-credentials or specify a custom credential provider for the class to use.

Would this be a feature that would be possible to implement in Presto? I'm thinking it would have to happen metastore object is called but I'm not sure what the implications for other metastores and how to handle the fact that others will be using static system credentials.

electrum commented 5 years ago

This sounds like a good feature. There are some implementation difficulties around how SemiTransactionalHiveMetastore works. I need to think about this more before writing a longer response.

Another implementation issue is how to make the credentials configurable per request in the AWS client without needing to create a new client (with associated connection pools and executors) for each request. Looking at the code, it calls the credentials provider for each request, so (unfortunately) the best way might be to use a thread local to wrapper the calls with the correct credentials.

cdw9bf commented 5 years ago

My impression from reading through the code is that this wouldn't be a trivial change.

I'll admit I'm pretty novice when it comes to Presto - would it be possible to pass in the configuration details of glue in HiveConnectorFactory where it initializes the connector? Here. Since it loads it in a Thread Context, does each user(/query?) has their own version of the object where the specific credentials could be specified?

electrum commented 5 years ago

The backend implementation of the metastore is bound in HiveMetastoreModule based on configuration. For Glue, it will bind GlueHiveMetastore for the HiveMetastore interface.

This backend instance gets injected into HiveMetadataFactory, which is used to create instances of SemiTransactionalHiveMetastore for each transaction, triggered by HiveConnector.beginTransaction().

I'm thinking we can change to a model where the HiveMetastore instance is created once per transaction. Instead of binding the backend HiveMetastore instance, we create a new interface HiveMetastoreFactory and bind an implementation of that, e.g., GlueHiveMetastoreFactory. That factory would hold the global shared objects like the heavyweight AWS client.

One problem is that Connector.beginTransaction() does not take ConnectorSession, but that's simple to add. That was probably an oversight when we added transaction support. In order to ensure that the username does not change during the transaction (which seems strange but is completely allowed), we can make HiveMetadata implement the ConnectorMetadata.beginQuery() method, which would throw if the username changed.

cdw9bf commented 5 years ago

To get the ConnectorSession object in the beginTransaction method, I assume that would require changing the interface of Connector?

I was also looking through where the beginTransaction method was called and I'm not sure that I see where we can pass the ConnectorSession object to it. The class InMemoryTransactionManager has the function createConnectorTransactionMetadata(CatalogName catalogName, Catalog catalog) that calls the beginTransaction method. From there, it looks like one of the places that it is called is in the Session.beginTransactionId, which appears to be the the object that you'd call to get the ConnectorSession?

So along your line of thinking, the beginTransaction method, we would call the a method like HiveMetastoreFactory.get(ConnectorSession connectorSession) and the HiveMetastoreFactory would be an interface with each of the metastore types implementing their own version. Then we could modify the HiveConnector constructor to take the factory as an argument so we can create the injector in the HiveModule.configure method and retrieve it in the HiveConnectorFactory.create method.

pettyjamesm commented 5 years ago

Another implementation issue is how to make the credentials configurable per request in the AWS client without needing to create a new client (with associated connection pools and executors) for each request. Looking at the code, it calls the credentials provider for each request, so (unfortunately) the best way might be to use a thread local to wrapper the calls with the correct credentials.

That’s actually not quite true, the base AWS SDK request type supports setting a credential provider per request. The client credential provider is only called and used if the request doesn’t have one set directly by the time execution is starting, so there shouldn’t be a strict need for using a client level ThreadLocal to make that possible.

electrum commented 5 years ago

@pettyjamesm You're right, the request object passed to each method extends AmazonWebServiceRequest which has setRequestCredentialsProvider(). Thanks for pointing that out, it makes this much easier.

cdw9bf commented 5 years ago

I think the better method would be withRequestCredentialsProvider() as it would allow for a small modification of the builder chain of each new <method>Request() (ex here). Regardless, it definitely looks like we can set it on a per request basis.

Instead of setting the glue metastore on a LocalThread object, we could then create a credentials factory and get the credential provider by specifying a PrincipalPrivileges class with each method. However I'm thinking that it would require us to modify the HiveMetastore interface to pass the PrincipalPrivileges. The other idea is to create the thread local GlueHiveMetastore and initialize it with the AsyncGlueClient and a set of custom credentials for that specific user/transaction then attach the credentials to each request.

Personally I think I like the latter better as it wouldn't require such extensive changes to the interface unless you see another way that I am missing?

electrum commented 5 years ago

Agreed, the with version is better as it allows chaining.

Rather than changing the HiveMetastore interface, I think we should proceed with the HiveMetastoreFactory idea where we create it once per transaction, after making the required changes to HiveConnector.beginTransaction() to pass the session. I can look into the difficulty of changing the transaction manager to allow this.

electrum commented 5 years ago

There's a circular dependency on passing ConnectorSession to Connector.beginTransaction(). The session contains catalog session properties, which have to be validated using ConnectorAccessControl.checkCanSetCatalogSessionProperty() before the session is created.