sunchao / parquet-rs

Apache Parquet implementation in Rust
Apache License 2.0
149 stars 20 forks source link

Panic when reading parquet files #178

Closed gnieto closed 5 years ago

gnieto commented 5 years ago

I downloaded some sample files from https://github.com/gitential/datasets/tree/master/antirez-redis and I'm not able to load the schema or read the file.

Example file: https://s3.amazonaws.com/gitential-datasets/antirez-redis/tags.parquet Branch: master Command: RUST_BACKTRACE=1 cargo run --bin parquet-schema -- tags.parquet Output:

thread 'main' panicked at 'assertion failed: `(left == right)`
  left: `0`,
 right: `6`', src/file/metadata.rs:234:5
stack backtrace:
   0: std::sys::unix::backtrace::tracing::imp::unwind_backtrace
             at libstd/sys/unix/backtrace/tracing/gcc_s.rs:49
   1: std::sys_common::backtrace::print
             at libstd/sys_common/backtrace.rs:71
             at libstd/sys_common/backtrace.rs:59
   2: std::panicking::default_hook::{{closure}}
             at libstd/panicking.rs:211
   3: std::panicking::default_hook
             at libstd/panicking.rs:227
   4: std::panicking::rust_panic_with_hook
             at libstd/panicking.rs:476
   5: std::panicking::continue_panic_fmt
             at libstd/panicking.rs:390
   6: std::panicking::begin_panic_fmt
             at libstd/panicking.rs:345
   7: parquet::file::metadata::RowGroupMetaData::from_thrift
             at src/file/metadata.rs:234
   8: <parquet::file::reader::SerializedFileReader<R>>::parse_metadata
             at src/file/reader.rs:197
   9: <parquet::file::reader::SerializedFileReader<R>>::new
             at src/file/reader.rs:150
  10: parquet_schema::main
             at src/bin/parquet-schema.rs:85
  11: std::rt::lang_start::{{closure}}
             at libstd/rt.rs:74
  12: std::panicking::try::do_call
             at libstd/rt.rs:59
             at libstd/panicking.rs:310
  13: __rust_maybe_catch_panic
             at libpanic_unwind/lib.rs:102
  14: std::rt::lang_start_internal
             at libstd/panicking.rs:289
             at libstd/panic.rs:392
             at libstd/rt.rs:58
  15: std::rt::lang_start
             at libstd/rt.rs:74
  16: main
  17: __libc_start_main
  18: _start
sadikovi commented 5 years ago

Interesting. Let me check the file.

sadikovi commented 5 years ago

Well, this is a funny problem. Our reader fails to compare number of column fields with the same value in row group - they should match obviously, except they don't. The reason is Thrift schema element sets number of children for the primitive type, see below:

SchemaElement { type_: None, type_length: None, repetition_type: Some(REQUIRED), name: "schema", num_children: Some(6), converted_type: None, scale: None, precision: None, field_id: None, logical_type: None }, 
SchemaElement { type_: Some(BYTE_ARRAY), type_length: None, repetition_type: Some(OPTIONAL), name: "id", num_children: Some(0), converted_type: Some(UTF8), scale: None, precision: None, field_id: None, logical_type: None }, 
SchemaElement { type_: Some(BYTE_ARRAY), type_length: None, repetition_type: Some(OPTIONAL), name: "name", num_children: Some(0), converted_type: Some(UTF8), scale: None, precision: None, field_id: None, logical_type: None }, 
SchemaElement { type_: Some(BYTE_ARRAY), type_length: None, repetition_type: Some(OPTIONAL), name: "message", num_children: Some(0), converted_type: Some(UTF8), scale: None, precision: None, field_id: None, logical_type: None }, 
SchemaElement { type_: Some(INT32), type_length: None, repetition_type: Some(OPTIONAL), name: "type", num_children: Some(0), converted_type: Some(UINT_8), scale: None, precision: None, field_id: None, logical_type: None }, 
SchemaElement { type_: Some(INT64), type_length: None, repetition_type: Some(OPTIONAL), name: "author_time", num_children: Some(0), converted_type: Some(TIMESTAMP_MILLIS), scale: None, precision: None, field_id: None, logical_type: None }, 
SchemaElement { type_: Some(INT64), type_length: None, repetition_type: Some(OPTIONAL), name: "__index_level_0__", num_children: Some(0), converted_type: None, scale: None, precision: None, field_id: None, logical_type: None }

Which results in the following schema, notice that there are no primitive types:

GroupType { 
  basic_info: BasicTypeInfo { name: "schema", repetition: Some(REQUIRED), logical_type: NONE, id: None }, 
  fields: [
    GroupType { 
      basic_info: BasicTypeInfo { name: "id", repetition: Some(OPTIONAL), logical_type: UTF8, id: None }, 
      fields: [] 
    }, 
    GroupType { 
      basic_info: BasicTypeInfo { name: "name", repetition: Some(OPTIONAL), logical_type: UTF8, id: None }, 
      fields: [] 
    }, 
    GroupType { 
      basic_info: BasicTypeInfo { name: "message", repetition: Some(OPTIONAL), logical_type: UTF8, id: None }, 
      fields: [] 
    }, 
    GroupType { 
      basic_info: BasicTypeInfo { name: "type", repetition: Some(OPTIONAL), logical_type: UINT_8, id: None }, 
      fields: []
    }, 
    GroupType { 
      basic_info: BasicTypeInfo { name: "author_time", repetition: Some(OPTIONAL), logical_type: TIMESTAMP_MILLIS, id: None }, 
      fields: [] 
    }, 
    GroupType { 
      basic_info: BasicTypeInfo { name: "__index_level_0__", repetition: Some(OPTIONAL), logical_type: NONE, id: None }, 
      fields: [] 
    }
  ] 
}

The reason this is happening is num_children: Some(0) part in SchemaElement, our code confirms to the Thrift definition and thinks that this is group type. Strictly speaking, our code does the right thing, see the definition of the field below:

/** Nested fields.  Since thrift does not support nested fields,
   * the nesting is flattened to a single list by a depth-first traversal.
   * The children count is used to construct the nested relationship.
   * This field is not set when the element is a primitive type
   */
  5: optional i32 num_children;

But in this file, it is set to 0! Again, it is a very simple fix:

diff --git a/src/schema/types.rs b/src/schema/types.rs
index 8bc6d64..e3f7fa3 100644
--- a/src/schema/types.rs
+++ b/src/schema/types.rs
@@ -828,7 +828,7 @@ fn from_thrift_helper(
   let logical_type = LogicalType::from(elements[index].converted_type);
   let field_id = elements[index].field_id;
   match elements[index].num_children {
-    None => {
+    None | Some(0) => {
       // primitive type
       if elements[index].repetition_type.is_none() {
         return Err(general_err!(

It looks like parquet-cpp 1.3.2 that was used to write the file actually violates the Thrift definition! Here is the actual metadata of the file:

version: 1
num of rows: 172
created by: parquet-cpp version 1.3.2-SNAPSHOT
REQUIRED group schema {
  OPTIONAL BYTE_ARRAY id (UTF8);
  OPTIONAL BYTE_ARRAY name (UTF8);
  OPTIONAL BYTE_ARRAY message (UTF8);
  OPTIONAL INT32 type (UINT_8);
  OPTIONAL INT64 author_time (TIMESTAMP_MILLIS);
  OPTIONAL INT64 __index_level_0__;
}

By the way, we would not be able to read the file with parquet-read because we do not have display of INT32(UINT_8) fields yet, so we would have to add that as well.

ping @sunchao

sadikovi commented 5 years ago

Thanks for reporting @gnieto! Let me know if you would like to fix this problem(s), otherwise, I will open a PR.

sunchao commented 5 years ago

Hmm this is interesting finding. Thanks for identifying the issue so quickly @sadikovi ! Yes we should fix it as well as support the extra types. We can also use the files in antirez-redis for testing purpose.

sadikovi commented 5 years ago

Thanks! Yes, I will open PRs today or tomorrow. Do you mean we should add those files to the repository? I was thinking if we could maintain a separate repository with all of the test files we plan to use and run parquet-schema and parquet-read on them to make sure we did not break anything. What do you think?

sunchao commented 5 years ago

No I meant to test them manually. Yes we can have a separate repo just for the test files, as long as it is convenient to pull from parquet-rs and test. Do you want to move all the files under data/ to there too?

sadikovi commented 5 years ago

No, I do not want to move those files into a separate repo - that was just an idea, maybe do it in the future. Yes, I will check other files, see if we can read those.

sadikovi commented 5 years ago

It looks like there is more than one problem with the file. First is the one with num_children, the second is the root message type has a repetition of Some(REQUIRED).

From Thrift definition:

/** repetition of the field. The root of the schema does not have a repetition_type.
   * All other nodes must have one */
  3: optional FieldRepetitionType repetition_type;

But it does have in this file. @sunchao I am happy to patch that as well? I am not sure why parquet-cpp is different. I will open PR in a couple of days.

sadikovi commented 5 years ago

It looks like our schema Thrift deserialisation code is not as robust as parquet-cpp. I will work on that.

sunchao commented 5 years ago

This is interesting. Seems the above definition already exist since parquet-format 1.0.0, which is 5 years ago.. not quite sure why parquet-cpp is different and whether parquet-mr also does the same thing.

Thanks for working on this @sadikovi !

sadikovi commented 5 years ago

It is all good. I sent an email on dev list with these questions, I will try patching it meanwhile. On Thu, 1 Nov 2018 at 6:05 PM, Chao Sun notifications@github.com wrote:

This is interesting. Seems the above definition already exist since parquet-format 1.0.0 https://github.com/apache/parquet-format/blob/parquet-format-1.0.0/src/thrift/parquet.thrift, which is 5 years ago.. not quite sure why parquet-cpp is different and whether parquet-mr also does the same thing.

Thanks for working on this @sadikovi https://github.com/sadikovi !

— You are receiving this because you were mentioned. Reply to this email directly, view it on GitHub https://github.com/sunchao/parquet-rs/issues/178#issuecomment-435107444, or mute the thread https://github.com/notifications/unsubscribe-auth/AHbY3gVRol0GK0wLDODhHPwpknvXckInks5uqynSgaJpZM4YDBXb .

sadikovi commented 5 years ago

I tried reading files from the s3 bucket. One file can't be read due to the issue with Int96, looks like our code things the value is invalid, but it could be our conversion. I will have a closer look.

Relates to #148 in a sense there are issues with Int96.

sunchao commented 5 years ago

@sadikovi , @gnieto : let me know if this can be closed now. :)

sadikovi commented 5 years ago

Well, we can close it, it will work with the example file. But there is another file timestamps.parquet which has Int96 values that we can’t convert to dates to print.

I suggest we close it, and open another issue if needed. On Wed, 7 Nov 2018 at 7:54 PM, Chao Sun notifications@github.com wrote:

@sadikovi https://github.com/sadikovi , @gnieto https://github.com/gnieto : let me know if this can be closed now. :)

— You are receiving this because you were mentioned. Reply to this email directly, view it on GitHub https://github.com/sunchao/parquet-rs/issues/178#issuecomment-436736943, or mute the thread https://github.com/notifications/unsubscribe-auth/AHbY3igxPoe-qosdvvoZVwKUenPJGaJVks5usyxxgaJpZM4YDBXb .

gnieto commented 5 years ago

It seems solved on the last version

sunchao commented 5 years ago

But there is another file timestamps.parquet which has Int96 values that we can’t convert to dates to print.

Is that related to the dates before 1970? we can open another issue for that.

sadikovi commented 5 years ago

Yes, it is. It is the same problem as Int96(0, 0, 0). On Wed, 7 Nov 2018 at 8:02 PM, Chao Sun notifications@github.com wrote:

But there is another file timestamps.parquet which has Int96 values that we can’t convert to dates to print.

Is that related to the dates before 1970? we can open another issue for that.

— You are receiving this because you were mentioned. Reply to this email directly, view it on GitHub https://github.com/sunchao/parquet-rs/issues/178#issuecomment-436739609, or mute the thread https://github.com/notifications/unsubscribe-auth/AHbY3gPGY9tasdW1ImqIJtLHjvht0jTAks5usy5RgaJpZM4YDBXb .