Open dieu-nguyen opened 1 month ago
I try to testing the file size, it seem like the threshold is 128MB (as my HDFS block size setup), it means that file <=128MB is fine, >128MB is fail.
While file size > 128MB, it stored in multiple blocks in HDFS, will it make the dictionary page not available to the second data block?
@shangxinli , please take a look
Hi there, any update on this @tdcmeehan ? Meanwhile, I had done a lot of test, and I found the config that can workaround on this problem. I set this config to true
hive.order-based-execution-enabled=true
Presto read a file in HDFS by creating multiple splits, this process divides the parquet file into multiple parts. If we enable the PME in file, each page become an undivided, because it need the whole data byte into to decrypt data. So I think there is something wrong with the split creating process.
This config make the "hive files become non-splittable", so bypass this splitting process and make every thing work fine.
I use presto to read Parquet file in HDFS. The parquet file has enable Parquet modular encryption. Reading small file is fine, but while reading large file, it fail at the decrypt function. Presto show error:
Query 20240509_030132_00001_r659k failed: GCM tag check failed
Your Environment
Expected Behavior
Data must be returned to client
Current Behavior
Fail while decrypt function
Possible Solution
TBD
Steps to Reproduce
CREATE EXTERNAL TABLE
test_schema
.customers_light
(Index
string,Customer Id
string,First Name
string,Last Name
string,Company
string,City
string,Country
string,Phone 1
string,Phone 2
string,Email
string,Subscription Date
string,Website
string,dict_col_1
string,dict_col_2
string) ROW FORMAT SERDE 'org.apache.hadoop.hive.ql.io.parquet.serde.ParquetHiveSerDe' STORED AS INPUTFORMAT 'org.apache.hadoop.hive.ql.io.parquet.MapredParquetInputFormat' OUTPUTFORMAT 'org.apache.hadoop.hive.ql.io.parquet.MapredParquetOutputFormat' LOCATION 'hdfs://hdfshost/tmp/test/customers_light';connector.name=hive-hadoop2 hive.metastore.uri=thrift://hdfshost:9083 hive.config.resources=/Users/lap15954-local/Data/dp-presto/presto-main/etc/hadoop/core-site.xml,/Users/lap15954-local/Data/dp-presto/presto-main/etc/hadoop/hdfs-site.xml hive.hdfs.impersonation.enabled=false hive.hdfs.authentication.type=NONE hive.parquet.use-column-names=true
./target/presto-cli-0.283-executable.jar
use hive.test_schema; select from test_schema.customers; select from test_schema.customers_light;