Closed gnieto closed 5 years ago
Interesting. Let me check the file.
Well, this is a funny problem. Our reader fails to compare number of column fields with the same value in row group - they should match obviously, except they don't. The reason is Thrift schema element sets number of children for the primitive type, see below:
SchemaElement { type_: None, type_length: None, repetition_type: Some(REQUIRED), name: "schema", num_children: Some(6), converted_type: None, scale: None, precision: None, field_id: None, logical_type: None },
SchemaElement { type_: Some(BYTE_ARRAY), type_length: None, repetition_type: Some(OPTIONAL), name: "id", num_children: Some(0), converted_type: Some(UTF8), scale: None, precision: None, field_id: None, logical_type: None },
SchemaElement { type_: Some(BYTE_ARRAY), type_length: None, repetition_type: Some(OPTIONAL), name: "name", num_children: Some(0), converted_type: Some(UTF8), scale: None, precision: None, field_id: None, logical_type: None },
SchemaElement { type_: Some(BYTE_ARRAY), type_length: None, repetition_type: Some(OPTIONAL), name: "message", num_children: Some(0), converted_type: Some(UTF8), scale: None, precision: None, field_id: None, logical_type: None },
SchemaElement { type_: Some(INT32), type_length: None, repetition_type: Some(OPTIONAL), name: "type", num_children: Some(0), converted_type: Some(UINT_8), scale: None, precision: None, field_id: None, logical_type: None },
SchemaElement { type_: Some(INT64), type_length: None, repetition_type: Some(OPTIONAL), name: "author_time", num_children: Some(0), converted_type: Some(TIMESTAMP_MILLIS), scale: None, precision: None, field_id: None, logical_type: None },
SchemaElement { type_: Some(INT64), type_length: None, repetition_type: Some(OPTIONAL), name: "__index_level_0__", num_children: Some(0), converted_type: None, scale: None, precision: None, field_id: None, logical_type: None }
Which results in the following schema, notice that there are no primitive types:
GroupType {
basic_info: BasicTypeInfo { name: "schema", repetition: Some(REQUIRED), logical_type: NONE, id: None },
fields: [
GroupType {
basic_info: BasicTypeInfo { name: "id", repetition: Some(OPTIONAL), logical_type: UTF8, id: None },
fields: []
},
GroupType {
basic_info: BasicTypeInfo { name: "name", repetition: Some(OPTIONAL), logical_type: UTF8, id: None },
fields: []
},
GroupType {
basic_info: BasicTypeInfo { name: "message", repetition: Some(OPTIONAL), logical_type: UTF8, id: None },
fields: []
},
GroupType {
basic_info: BasicTypeInfo { name: "type", repetition: Some(OPTIONAL), logical_type: UINT_8, id: None },
fields: []
},
GroupType {
basic_info: BasicTypeInfo { name: "author_time", repetition: Some(OPTIONAL), logical_type: TIMESTAMP_MILLIS, id: None },
fields: []
},
GroupType {
basic_info: BasicTypeInfo { name: "__index_level_0__", repetition: Some(OPTIONAL), logical_type: NONE, id: None },
fields: []
}
]
}
The reason this is happening is num_children: Some(0)
part in SchemaElement
, our code confirms to the Thrift definition and thinks that this is group type. Strictly speaking, our code does the right thing, see the definition of the field below:
/** Nested fields. Since thrift does not support nested fields,
* the nesting is flattened to a single list by a depth-first traversal.
* The children count is used to construct the nested relationship.
* This field is not set when the element is a primitive type
*/
5: optional i32 num_children;
But in this file, it is set to 0! Again, it is a very simple fix:
diff --git a/src/schema/types.rs b/src/schema/types.rs
index 8bc6d64..e3f7fa3 100644
--- a/src/schema/types.rs
+++ b/src/schema/types.rs
@@ -828,7 +828,7 @@ fn from_thrift_helper(
let logical_type = LogicalType::from(elements[index].converted_type);
let field_id = elements[index].field_id;
match elements[index].num_children {
- None => {
+ None | Some(0) => {
// primitive type
if elements[index].repetition_type.is_none() {
return Err(general_err!(
It looks like parquet-cpp 1.3.2 that was used to write the file actually violates the Thrift definition! Here is the actual metadata of the file:
version: 1
num of rows: 172
created by: parquet-cpp version 1.3.2-SNAPSHOT
REQUIRED group schema {
OPTIONAL BYTE_ARRAY id (UTF8);
OPTIONAL BYTE_ARRAY name (UTF8);
OPTIONAL BYTE_ARRAY message (UTF8);
OPTIONAL INT32 type (UINT_8);
OPTIONAL INT64 author_time (TIMESTAMP_MILLIS);
OPTIONAL INT64 __index_level_0__;
}
By the way, we would not be able to read the file with parquet-read
because we do not have display of INT32(UINT_8)
fields yet, so we would have to add that as well.
ping @sunchao
Thanks for reporting @gnieto! Let me know if you would like to fix this problem(s), otherwise, I will open a PR.
Hmm this is interesting finding. Thanks for identifying the issue so quickly @sadikovi ! Yes we should fix it as well as support the extra types. We can also use the files in antirez-redis for testing purpose.
Thanks! Yes, I will open PRs today or tomorrow. Do you mean we should add those files to the repository? I was thinking if we could maintain a separate repository with all of the test files we plan to use and run parquet-schema and parquet-read on them to make sure we did not break anything. What do you think?
No I meant to test them manually. Yes we can have a separate repo just for the test files, as long as it is convenient to pull from parquet-rs
and test. Do you want to move all the files under data/
to there too?
No, I do not want to move those files into a separate repo - that was just an idea, maybe do it in the future. Yes, I will check other files, see if we can read those.
It looks like there is more than one problem with the file. First is the one with num_children, the second is the root message type has a repetition of Some(REQUIRED)
.
From Thrift definition:
/** repetition of the field. The root of the schema does not have a repetition_type.
* All other nodes must have one */
3: optional FieldRepetitionType repetition_type;
But it does have in this file. @sunchao I am happy to patch that as well? I am not sure why parquet-cpp is different. I will open PR in a couple of days.
It looks like our schema Thrift deserialisation code is not as robust as parquet-cpp. I will work on that.
This is interesting. Seems the above definition already exist since parquet-format 1.0.0, which is 5 years ago.. not quite sure why parquet-cpp is different and whether parquet-mr also does the same thing.
Thanks for working on this @sadikovi !
It is all good. I sent an email on dev list with these questions, I will try patching it meanwhile. On Thu, 1 Nov 2018 at 6:05 PM, Chao Sun notifications@github.com wrote:
This is interesting. Seems the above definition already exist since parquet-format 1.0.0 https://github.com/apache/parquet-format/blob/parquet-format-1.0.0/src/thrift/parquet.thrift, which is 5 years ago.. not quite sure why parquet-cpp is different and whether parquet-mr also does the same thing.
Thanks for working on this @sadikovi https://github.com/sadikovi !
— You are receiving this because you were mentioned. Reply to this email directly, view it on GitHub https://github.com/sunchao/parquet-rs/issues/178#issuecomment-435107444, or mute the thread https://github.com/notifications/unsubscribe-auth/AHbY3gVRol0GK0wLDODhHPwpknvXckInks5uqynSgaJpZM4YDBXb .
I tried reading files from the s3 bucket. One file can't be read due to the issue with Int96, looks like our code things the value is invalid, but it could be our conversion. I will have a closer look.
Relates to #148 in a sense there are issues with Int96.
@sadikovi , @gnieto : let me know if this can be closed now. :)
Well, we can close it, it will work with the example file. But there is another file timestamps.parquet which has Int96 values that we can’t convert to dates to print.
I suggest we close it, and open another issue if needed. On Wed, 7 Nov 2018 at 7:54 PM, Chao Sun notifications@github.com wrote:
@sadikovi https://github.com/sadikovi , @gnieto https://github.com/gnieto : let me know if this can be closed now. :)
— You are receiving this because you were mentioned. Reply to this email directly, view it on GitHub https://github.com/sunchao/parquet-rs/issues/178#issuecomment-436736943, or mute the thread https://github.com/notifications/unsubscribe-auth/AHbY3igxPoe-qosdvvoZVwKUenPJGaJVks5usyxxgaJpZM4YDBXb .
It seems solved on the last version
But there is another file timestamps.parquet which has Int96 values that we can’t convert to dates to print.
Is that related to the dates before 1970? we can open another issue for that.
Yes, it is. It is the same problem as Int96(0, 0, 0). On Wed, 7 Nov 2018 at 8:02 PM, Chao Sun notifications@github.com wrote:
But there is another file timestamps.parquet which has Int96 values that we can’t convert to dates to print.
Is that related to the dates before 1970? we can open another issue for that.
— You are receiving this because you were mentioned. Reply to this email directly, view it on GitHub https://github.com/sunchao/parquet-rs/issues/178#issuecomment-436739609, or mute the thread https://github.com/notifications/unsubscribe-auth/AHbY3gPGY9tasdW1ImqIJtLHjvht0jTAks5usy5RgaJpZM4YDBXb .
I downloaded some sample files from https://github.com/gitential/datasets/tree/master/antirez-redis and I'm not able to load the schema or read the file.
Example file: https://s3.amazonaws.com/gitential-datasets/antirez-redis/tags.parquet Branch: master Command: RUST_BACKTRACE=1 cargo run --bin parquet-schema -- tags.parquet Output: