prestodb / presto

The official home of the Presto distributed SQL query engine for big data
http://prestodb.io
Apache License 2.0
16.05k stars 5.38k forks source link

Decryption problem running presto queries with AWS client side encryption using KMS #2945

Closed pandeyp closed 7 years ago

pandeyp commented 9 years ago

I have used your latest script that successfully installs presto server(version 0.99) and java 8 on Amazon EMR instance (3.7.0). My data files are located in a s3 bucket encrypted with client-side customer managed key that were encrypted . When I create a hive table that references those encrypted data files in s3, hive can successfully decrypt the records and display it in console. However, when viewing the same external table from presto command line interface the data is displayed in its encrypted form. I have looked at your link given in: https://prestodb.io/docs/current/release/release-0.57.html and added those properties in my hive.properties file and it looks like given below.

hive.s3.connect-timeout=2m
hive.s3.max-backoff-time=10m
hive.s3.max-error-retries=50
hive.metastore-refresh-interval=1m
hive.s3.max-connections=500
hive.s3.max-client-retries=50
connector.name=hive-hadoop2
hive.s3.socket-timeout=2m
hive.s3.aws-access-key=***
hive.s3.aws-secret-key=***
hive.metastore.uri=thrift://localhost:9083
hive.metastore-cache-ttl=20m
hive.s3.staging-directory=/mnt/tmp/
hive.s3.use-instance-credentials=true

Any help on how to decrypt the files in using presto cli will be much appreciated.

electrum commented 9 years ago

When you say that Hive can read them, is that using Amazon's custom S3 FileSystem for Hadoop? Do you know where the client side key is stored?

pandeyp commented 9 years ago

Thanks for your prompt reply. Yes, the files for our Hive tables are located in Amazon's S3 filesystem. We have encrypted files in S3 using S3 client side encryption with AWS Key Management Service (KMS) using AWS SDK for Java as outlined in http://docs.aws.amazon.com/AmazonS3/latest/dev/client-side-using-kms-java.html.

The SDK encrypts files using AWS KMS customer master key ID and places them in given S3 bucket. In AWS KMS you can attach several users to this master key ID, to enable them to encrypt/decrypt the data. The data in s3 bucket can be accessed by specifying aws-access-key and aws-secret-key given to each user.

To copy these encrypted files to EMR's HDFS I have used s3distcp and added it as a step as given in http://docs.aws.amazon.com/ElasticMapReduce/latest/DeveloperGuide/UsingEMR_s3distcp.html. Running aws configure and adding aws access key id and aws secret assess key and then using add-step successfully decrypted the data files and copied them to EMR's HDFS.

While using EMR Hive at first we couldn't access/decrypt data in s3 with error trace: java.io.IOException: com.amazonaws.AmazonServiceException: The ciphertext references a key that either does not exist or you do not have access to. (Service: AWSKMS; Status Code: 400; Error Code: AccessDeniedException; Request ID: 76c9ed32-e8ef-11e4-b06e-1d1eec85421f)

We had to attach two EC2 roles namely EMR_DefaultRole and EMR_EC2_DefaultRole to our master key id and then the files were decrypted.

pandeyp commented 9 years ago

Any updates on this?

mombergm commented 9 years ago

My understanding of the Presto S3 library at the moment is that only S3 server side encryption is supported and not S3 client side encryption. We will need to expand on the PrestoS3FS in order to support client side / KMS encryption. @electrum, what do you think?

electrum commented 9 years ago

Do you know what changes are required to make this work?

mombergm commented 9 years ago

I haven't done this in Java myself yet, but my understanding would be that you would invoke a new AmazonS3EncryptionClient instead of the normal AmazonS3Client. doc ref

It might be as simple as adding a few new config params and then based on those invoke the correct client during createAmazonS3Client() in the PrestoS3FileSystem.

I doubt it would be that simple though. I'll try to find out how EMRFS does it.

swaranga commented 7 years ago

Any updates? Is this planned for any future releases. The way EmrFs does it is that they allow custom EncryptionMaterialsProvider to be specified at cluster creation time, using which they create the AmazonS3EncryptionClient. This allows clients to use any encryption solution and not just KMS based encryption.

electrum commented 7 years ago

This is supported now by setting hive.s3.encryption-materials-provider: https://prestodb.io/docs/current/connector/hive.html#s3-data-encryption