Reading large parquet file with parquet modular encryption fail

dieu-nguyen commented 1 month ago

I use presto to read Parquet file in HDFS. The parquet file has enable Parquet modular encryption. Reading small file is fine, but while reading large file, it fail at the decrypt function. Presto show error: Query 20240509_030132_00001_r659k failed: GCM tag check failed

Your Environment

Presto version used: 0.283
Storage: HDFS
Data source and connector used: hive-hadoop2 connector, hive metastore, parquet file with PME, using InMemoryKMS
Deployment: On-prem
Link to the complete debug logs: presto_error_log

Expected Behavior

Data must be returned to client

Current Behavior

Fail while decrypt function

Possible Solution

TBD

Steps to Reproduce

Prepare data:
- Data source: https://github.com/datablist/sample-csv-files?tab=readme-ov-file
- customers-2000000.csv
- customers-100000.csv
- Download the data and adding encryption to file, store as .parquet file: create_pme_file.py.zip
- InMemoryKMS taken from this repo parquet-hadoop
- Encrypt 2 fields: Email, dict_col_1
- File size are:
- test_pme_file.zip

Put 2 files to HDFS, create hive external table

Create table query:


CREATE EXTERNAL TABLE `test_schema`.`customers`(
`Index` string,
`Customer Id` string,
`First Name` string,
`Last Name` string,
`Company` string,
`City` string,
`Country` string,
`Phone 1` string,
`Phone 2` string,
`Email` string,
`Subscription Date` string,
`Website` string,
`dict_col_1` string,
`dict_col_2` string)
ROW FORMAT SERDE
'org.apache.hadoop.hive.ql.io.parquet.serde.ParquetHiveSerDe'
STORED AS INPUTFORMAT
'org.apache.hadoop.hive.ql.io.parquet.MapredParquetInputFormat'
OUTPUTFORMAT
'org.apache.hadoop.hive.ql.io.parquet.MapredParquetOutputFormat'
LOCATION
'hdfs://hdfshost/tmp/test/customers';

CREATE EXTERNAL TABLE test_schema.customers_light( Index string, Customer Id string, First Name string, Last Name string, Company string, City string, Country string, Phone 1 string, Phone 2 string, Email string, Subscription Date string, Website string, dict_col_1 string, dict_col_2 string) ROW FORMAT SERDE 'org.apache.hadoop.hive.ql.io.parquet.serde.ParquetHiveSerDe' STORED AS INPUTFORMAT 'org.apache.hadoop.hive.ql.io.parquet.MapredParquetInputFormat' OUTPUTFORMAT 'org.apache.hadoop.hive.ql.io.parquet.MapredParquetOutputFormat' LOCATION 'hdfs://hdfshost/tmp/test/customers_light';

3. Start presto, with hive.properties and core-site.xml config file:
- hive.properties

connector.name=hive-hadoop2 hive.metastore.uri=thrift://hdfshost:9083 hive.config.resources=/Users/lap15954-local/Data/dp-presto/presto-main/etc/hadoop/core-site.xml,/Users/lap15954-local/Data/dp-presto/presto-main/etc/hadoop/hdfs-site.xml hive.hdfs.impersonation.enabled=false hive.hdfs.authentication.type=NONE hive.parquet.use-column-names=true

- core-site.xml

<property>
    <name>parquet.encryption.kms.client.class</name>
    <value>org.apache.parquet.crypto.aws.InMemoryKMS</value>
</property>
<property>
    <name>parquet.encryption.key.list</name>
    <value>keyA:AAECAwQFBgcICQoLDA0ODw== , keyB:AAECAAECAAECAAECAAECAA==</value>
</property>
<property>
    <name>parquet.crypto.factory.class</name>
    <value>org.apache.parquet.crypto.keytools.PropertiesDrivenCryptoFactory</value>
</property>


4. Query data using presto-cli

./target/presto-cli-0.283-executable.jar

use hive.test_schema; select from test_schema.customers; select from test_schema.customers_light;


- It is ok while query the column that is not encrypted: <img width="569" alt="image" src="https://github.com/prestodb/presto/assets/105227806/7fc38651-3f55-4cae-808d-35dc2072fcb7">
- Error while query the encrypted column: 
<img width="643" alt="image" src="https://github.com/prestodb/presto/assets/105227806/af4d58c8-6b88-4acb-8054-5abfb085170a">.  
- But the small file table is completely fine: 
<img width="602" alt="image" src="https://github.com/prestodb/presto/assets/105227806/a0ed1e8c-0d27-4b78-9786-9bd9eb5bd02b">
<img width="616" alt="image" src="https://github.com/prestodb/presto/assets/105227806/42f84f63-b555-4773-8644-6aa7699f7b54">

- Related stack trace:  [presto_error_log](https://pastebin.com/HjQt0qcF)

## Context
<!--- How has this issue affected you? -->
<!--- Providing context helps us come up with a solution that is most useful in the real world -->

dieu-nguyen commented 1 month ago

I try to testing the file size, it seem like the threshold is 128MB (as my HDFS block size setup), it means that file <=128MB is fine, >128MB is fail.
While file size > 128MB, it stored in multiple blocks in HDFS, will it make the dictionary page not available to the second data block?

dieu-nguyen commented 1 month ago

@shangxinli , please take a look

dieu-nguyen commented 1 month ago

Hi there, any update on this @tdcmeehan ? Meanwhile, I had done a lot of test, and I found the config that can workaround on this problem. I set this config to true

hive.order-based-execution-enabled=true

Presto read a file in HDFS by creating multiple splits, this process divides the parquet file into multiple parts. If we enable the PME in file, each page become an undivided, because it need the whole data byte into to decrypt data. So I think there is something wrong with the split creating process.

This config make the "hive files become non-splittable", so bypass this splitting process and make every thing work fine.

prestodb / presto