prestodb / presto

The official home of the Presto distributed SQL query engine for big data
http://prestodb.io
Apache License 2.0
15.74k stars 5.28k forks source link

[native] Native workers not writing Parquet data files for WriterVersion v1 (PARQUET_1_0) #22595

Open agrawalreetika opened 2 months ago

agrawalreetika commented 2 months ago

Native worker not writing Parquet data files for WriterVersion v1 (PARQUET_1_0)

Your Environment

Expected Behavior

When set session hive.parquet_writer_version='PARQUET_1_0'; Parquet data should be written in format_version 1

Current Behavior

Even if when setting set session hive.parquet_writer_version='PARQUET_1_0'; Parquet data is written in format_version: 2.6

Possible Solution

Steps to Reproduce

presto:reetika_testdb> set session hive.parquet_writer_version='PARQUET_1_0';
SET SESSION

presto:reetika_testdb> create table hive.reetika_testdb.test_insert (id int) with (format = 'Parquet');
CREATE TABLE

presto:reetika_testdb> insert into hive.reetika_testdb.test_insert values(1);
INSERT: 1 row

Sample Output of Parquet File -

############ file meta data ############
created_by: parquet-cpp-velox
num_columns: 1
num_rows: 1
num_row_groups: 1
format_version: 2.6
serialized_size: 146

############ Columns ############
id

############ Column(id) ############
name: id
path: id
max_definition_level: 1
max_repetition_level: 0
physical_type: INT32
logical_type: None
converted_type (legacy): NONE
compression: GZIP (space_saved: -56%)

Screenshots (if appropriate)

Context

Looks like the session property for parquet_writer_version is not honored in Prestissimo. Same works fine with Jave Parquet Writer

majetideepak commented 2 months ago

Velox uses the Arrow Parquet Writer. I see that there is an option to specify V1 https://github.com/apache/arrow/blob/main/cpp/src/parquet/properties.h Let's add it to Velox. Can you point me to a test for V1 vs V2?

svm1 commented 1 month ago

Fix in progress - https://github.com/facebookincubator/velox/pull/9700