sunchao / parquet-rs

Apache Parquet implementation in Rust
Apache License 2.0
149 stars 20 forks source link

ParquetDecodingException When Writing Boolean #195

Closed jcgomes90 closed 5 years ago

jcgomes90 commented 5 years ago

I have a schema with a field of type boolean defined

let message = "message schma {
REQUIRED BOOLEAN a;
}";

I am using the BoolColumnWriter to write the boolean data:

match col_writer {
     ColumnWriter::BoolColumnWriter(ref mut typed_writer) => {
      typed_writer.write_batch(
       &[true],
        None,
        None
        ).unwrap();
         },
         _ => { }
 }

When I try to run the generated parquet file, I get the following exception: org.apache.parquet.io.ParquetDecodingException: Can not read value at 0 in block -1 in file

I cant seem to get passed this issue. Any idea why this is happening?

sadikovi commented 5 years ago

Wow! Thanks for reporting the error. I am not sure what system you use to read Parquet files, but it apparently uses parquet-mr.

It looks like the problem is using dictionary encoding for Boolean type. It does not seem to be supported in parquet-mr. See below:

When I write file of a single boolean with dictionary enabled, I get the following:

$ RUST_BACKTRACE=1 cargo run --bin parquet-write
   Compiling parquet v0.4.2 (/parquet-rs)
    Finished dev [unoptimized + debuginfo] target(s) in 2.56s
     Running `target/debug/parquet-write`
File is written!

Reading file using parquet-read:

$ RUST_BACKTRACE=1 cargo run --bin parquet-read ./sample.parquet 
   Compiling parquet v0.4.2 (/parquet-rs)
    Finished dev [unoptimized + debuginfo] target(s) in 5.60s
     Running `target/debug/parquet-read ./sample.parquet`
{a: true}

But Spark gives the error:

org.apache.parquet.io.ParquetDecodingException: Dictionary encoding not supported for type: BOOLEAN
    at org.apache.parquet.column.Encoding$1.initDictionary(Encoding.java:104)
    at org.apache.spark.sql.execution.datasources.parquet.VectorizedColumnReader.<init>(VectorizedColumnReader.java:103)

When I disable dictionary, I get the following output:

18/11/28 09:09:22 INFO DAGScheduler: Job 3 finished: show at <console>:24, took 0.176370 s
18/11/28 09:09:22 INFO CodeGenerator: Code generated in 26.681134 ms
+----+
|   a|
+----+
|true|
+----+
sadikovi commented 5 years ago

So the idea is disabling dictionary encoding either for the entire schema or for a column. To do that simply use either:

let props = WriterProperties::builder()
  .set_dictionary_enabled(false).build();

or

use parquet::schema::types::ColumnPath;
let props = WriterProperties::builder()
  .set_column_dictionary_enabled(ColumnPath::from("a"), false)
  .build();

Full API: https://sunchao.github.io/parquet-rs/master/parquet/file/properties/index.html

Try that, see if it works for you. Meanwhile I will try fixing it in the codebase.

sadikovi commented 5 years ago

@sunchao It looks like parquet-mr does not apply dictionary encoding for boolean fields at all. Here is what it does:

V1 - PLAIN: https://github.com/apache/parquet-mr/blob/dc61e510126aaa1a95a46fe39bf1529f394147e9/parquet-column/src/main/java/org/apache/parquet/column/values/factory/DefaultV1ValuesWriterFactory.java#L74

V2 - RLE: https://github.com/apache/parquet-mr/blob/dc61e510126aaa1a95a46fe39bf1529f394147e9/parquet-column/src/main/java/org/apache/parquet/column/values/factory/DefaultV2ValuesWriterFactory.java#L78

Should we do the same thing? I can patch it quickly.

sunchao commented 5 years ago

Yes please go ahead. Thanks. I guess it doesn't make much sense to use dictionary encoding for boolean types.