Open nevi-me opened 5 years ago
I would use TIMESTAMP_MILLIS now, which is just INT64 with corresponding logical type, probably the easiest to write.
Thanks @sadikovi, I was confused by the UTC stuff on the timestamp logical type.
Writing a timestamp now works with message schema {REQUIRED INT64 MyField (TIMESTAMP_MILLIS)}
, but I'm unable to read the parquet file back in Pandas or PySpark.
PySpark:
spark.read.parquet("file1.parquet").printSchema()
// this correctly shows the schema as below, but .show() throws an error
// printing schema
root
|-- Id: string (nullable = true)
|-- Name: string (nullable = true)
|-- Indicator: boolean (nullable = true)
|-- Timestamp: timestamp (nullable = true)
# trying to show records
Py4JJavaError: An error occurred while calling o62.showString.
: org.apache.spark.SparkException: Job aborted due to stage failure: Task 0 in stage 6.0 failed 1 times, most recent failure: Lost task 0.0 in stage 6.0 (TID 16, localhost, executor driver): org.apache.parquet.io.ParquetDecodingException: Dictionary encoding not supported for type: BOOLEAN
Pandas:
pd.read_parquet("file1.parquet")
ArrowIOError: Not yet implemented: Dictionary encoding is not implemented for boolean values.
I'm continuing with my adventures of writing csv to parquet, but I got stuck with how to write times/dates to parquet. Specifically, how do I declare the schema (assuming I'm using the text format
message schema {}
)?I read up on the logical types and their mapping to/from data types, so I tried using
i64
for my schema, but I think I'm missing something because I don't know how to map the type to aTIMESTAMP
.I also tried Google, to try look for the format of the schema, but with no luck (for timestamps). Is there some place that documents this?