sunchao / parquet-rs

Apache Parquet implementation in Rust
Apache License 2.0
149 stars 20 forks source link

Message Type parser #15

Closed sadikovi closed 7 years ago

sadikovi commented 7 years ago

This PR introduces schema parser that is used to convert string representation of message type into instance of parquet::schema::types::Type.

API:

Usage is below:

extern crate parquet;

use parquet::schema::parser::parse_message_type;

fn main() {
  let schema = "
  message schema {
    required int32 a;
    required int64 b;
    optional binary c (UTF8);
    required group d {
      required int32 a;
      required int64 b;
      optional binary c (UTF8);
    }
    required group e (LIST) {
      repeated group list {
        required int32 element;
      }
    }
  }
  ";
  println!("{}", schema);
  println!("{:?}", parse_message_type(schema).unwrap());
}

This class is created as a Printer counterpart.

I also had to introduce FromStr traits for physical, logical types, and repetition to reduce amount of code to write for parsing.

Added unittests and also tested manually on some nested and plain schemas.

sadikovi commented 7 years ago

@sunchao this is something I thought would be useful to have. Could you review this pull request? Thanks!

coveralls commented 7 years ago

Coverage Status

Coverage increased (+1.3%) to 85.528% when pulling fb78438b49259f4013b5b536f116673bfd968159 on sadikovi:schema-parser into 0d09371de45d958838fe41380af11c7eca49bb85 on sunchao:master.

sunchao commented 7 years ago

Thanks @sadikovi . This PR looks great! On the high level, I wonder if there's a standard format for message string of Parquet that this is expecting?

sadikovi commented 7 years ago

@sunchao thank you for taking a look at this PR.

I do not know if there is an explicit standard. I think it is okay as long as string contains json like format that is parseble by this code. It is modelled after MessageTypeParser in Parquet-mr.

The reason I added it was that I was planning to get schematic as string parse it and then check if parsed scheme is a subtype of Parquet full scheme.

Are you aware of any standard on schema string?

sunchao commented 7 years ago

@sadikovi I'm not aware of any standard on this, but seems parquet-cpp and parquet-mr are using the same format. I also noticed that the printer for parquet-rs has some differences, such as it doesn't print length for fixed_length_byte_array, which we may need to fix.

sunchao commented 7 years ago

Patch looks good. Merged. Thanks @sadikovi !

sadikovi commented 7 years ago

Thanks for merging, I will have a look at the printer issue.