Closed jcgomes90 closed 5 years ago
Wow! Thanks for reporting the error. I am not sure what system you use to read Parquet files, but it apparently uses parquet-mr.
It looks like the problem is using dictionary encoding for Boolean type. It does not seem to be supported in parquet-mr. See below:
When I write file of a single boolean with dictionary enabled, I get the following:
$ RUST_BACKTRACE=1 cargo run --bin parquet-write
Compiling parquet v0.4.2 (/parquet-rs)
Finished dev [unoptimized + debuginfo] target(s) in 2.56s
Running `target/debug/parquet-write`
File is written!
Reading file using parquet-read:
$ RUST_BACKTRACE=1 cargo run --bin parquet-read ./sample.parquet
Compiling parquet v0.4.2 (/parquet-rs)
Finished dev [unoptimized + debuginfo] target(s) in 5.60s
Running `target/debug/parquet-read ./sample.parquet`
{a: true}
But Spark gives the error:
org.apache.parquet.io.ParquetDecodingException: Dictionary encoding not supported for type: BOOLEAN
at org.apache.parquet.column.Encoding$1.initDictionary(Encoding.java:104)
at org.apache.spark.sql.execution.datasources.parquet.VectorizedColumnReader.<init>(VectorizedColumnReader.java:103)
When I disable dictionary, I get the following output:
18/11/28 09:09:22 INFO DAGScheduler: Job 3 finished: show at <console>:24, took 0.176370 s
18/11/28 09:09:22 INFO CodeGenerator: Code generated in 26.681134 ms
+----+
| a|
+----+
|true|
+----+
So the idea is disabling dictionary encoding either for the entire schema or for a column. To do that simply use either:
let props = WriterProperties::builder()
.set_dictionary_enabled(false).build();
or
use parquet::schema::types::ColumnPath;
let props = WriterProperties::builder()
.set_column_dictionary_enabled(ColumnPath::from("a"), false)
.build();
Full API: https://sunchao.github.io/parquet-rs/master/parquet/file/properties/index.html
Try that, see if it works for you. Meanwhile I will try fixing it in the codebase.
@sunchao It looks like parquet-mr does not apply dictionary encoding for boolean fields at all. Here is what it does:
Should we do the same thing? I can patch it quickly.
Yes please go ahead. Thanks. I guess it doesn't make much sense to use dictionary encoding for boolean types.
I have a schema with a field of type boolean defined
I am using the BoolColumnWriter to write the boolean data:
When I try to run the generated parquet file, I get the following exception: org.apache.parquet.io.ParquetDecodingException: Can not read value at 0 in block -1 in file
I cant seem to get passed this issue. Any idea why this is happening?