sunchao / parquet-rs

Apache Parquet implementation in Rust
Apache License 2.0
149 stars 20 forks source link

Issue when reading required INT32 column with PLAIN encoding #11

Closed sadikovi closed 7 years ago

sadikovi commented 7 years ago

For required fields with PLAIN encoding, current code read_batch in ColumnReaderImpl goes into infinite loop. It happens when field is required, it does not have repetition/definition levels, therefore values_to_read keeps being set to 0, which results in buffer being &[values_read..values_read + values_to_read] -> empty. This goes into infinite loop because decoding for empty buffer returns 0, and loop never progresses.

Fix for the issue is setting values_to_read explicitly, when field is required (assertion is optional):

diff --git a/src/column/reader.rs b/src/column/reader.rs
index 9d36918..2a230e8 100644
--- a/src/column/reader.rs
+++ b/src/column/reader.rs
@@ -154,6 +154,8 @@ impl<'a, T: DataType> ColumnReaderImpl<'a, T> where T: 'static {
             }
           }
         }
+      } else {
+        values_to_read = batch_size;
       }

       if self.descr.max_rep_level() > 0 && rep_levels.is_some() {
diff --git a/src/encodings/decoding.rs b/src/encodings/decoding.rs
index cf98276..a76ace8 100644
--- a/src/encodings/decoding.rs
+++ b/src/encodings/decoding.rs
@@ -127,6 +127,7 @@ default impl<T: DataType> Decoder<T> for PlainDecoder<T> {
   #[inline]
   fn get(&mut self, buffer: &mut [T::T]) -> Result<usize> {
     assert!(self.data.is_some());
+    assert!(buffer.len() != 0);

     let data = self.data.as_mut().unwrap();
     let num_values = cmp::min(buffer.len(), self.num_values);

Here is schema information for which it fails.

version: 1
num of rows: 4
created by: parquet-mr version 1.8.1 (build 4aba4dae7bb0d4edbcf7923ae1339f28fd3f7fcf)
message spark_schema {
  REQUIRED INT32 a;
  REQUIRED INT64 b;
  OPTIONAL BYTE_ARRAY c;
  REQUIRED group d {
    REQUIRED INT32 a;
    REQUIRED INT64 b;
    OPTIONAL BYTE_ARRAY c;
  }
  REQUIRED group e (LIST) {
    REPEATED group list {
      REQUIRED INT32 element;
    }
  }
}
sadikovi commented 7 years ago

@sunchao please review and advice on next steps:)

sadikovi commented 7 years ago

Attached Parquet file that I am using to test this bug, and running my example.rs code.

spark.snappy.parquet.zip

sunchao commented 7 years ago

@sadikovi Yes you're right. Please file a PR with a test case for this. Thanks :)