sunchao / parquet-rs

Apache Parquet implementation in Rust
Apache License 2.0
149 stars 20 forks source link

Writing Dates and Timestamps #188

Open nevi-me opened 5 years ago

nevi-me commented 5 years ago

I'm continuing with my adventures of writing csv to parquet, but I got stuck with how to write times/dates to parquet. Specifically, how do I declare the schema (assuming I'm using the text format message schema {})?

I read up on the logical types and their mapping to/from data types, so I tried using i64 for my schema, but I think I'm missing something because I don't know how to map the type to a TIMESTAMP.

I also tried Google, to try look for the format of the schema, but with no luck (for timestamps). Is there some place that documents this?

sadikovi commented 5 years ago

https://github.com/apache/parquet-format/blob/master/LogicalTypes.md#datetime-types

sadikovi commented 5 years ago

I would use TIMESTAMP_MILLIS now, which is just INT64 with corresponding logical type, probably the easiest to write.

nevi-me commented 5 years ago

Thanks @sadikovi, I was confused by the UTC stuff on the timestamp logical type.

Writing a timestamp now works with message schema {REQUIRED INT64 MyField (TIMESTAMP_MILLIS)}, but I'm unable to read the parquet file back in Pandas or PySpark.

PySpark:

spark.read.parquet("file1.parquet").printSchema()
// this correctly shows the schema as below, but .show() throws an error
// printing schema
root
 |-- Id: string (nullable = true)
 |-- Name: string (nullable = true)
 |-- Indicator: boolean (nullable = true)
 |-- Timestamp: timestamp (nullable = true)
# trying to show records

Py4JJavaError: An error occurred while calling o62.showString.
: org.apache.spark.SparkException: Job aborted due to stage failure: Task 0 in stage 6.0 failed 1 times, most recent failure: Lost task 0.0 in stage 6.0 (TID 16, localhost, executor driver): org.apache.parquet.io.ParquetDecodingException: Dictionary encoding not supported for type: BOOLEAN 

Pandas:

pd.read_parquet("file1.parquet")

ArrowIOError: Not yet implemented: Dictionary encoding is not implemented for boolean values.