mukunku / ParquetViewer

Simple Windows desktop application for viewing & querying Apache Parquet files
GNU General Public License v3.0
689 stars 82 forks source link

[BUG] very low/high dates/timestamps (0001-01-01 and 9999-12-31 23:59:59.9999) cause problems #55

Closed keen85 closed 1 year ago

keen85 commented 1 year ago

First of all: I'm not sure if my problem is a problem of ParquetViewer or parquet-dotnet please let me know if I'm wrong here...

Parquet Viewer Version v2.3.6

Where was the parquet file created? Apache Spark 3.1.2

Sample File PySpark code for creating the file

import datetime

df = sc.parallelize([
    [
        1,
        datetime.date(1985, 12, 31),
        datetime.date(   1,  1,  2),
        datetime.date(9999, 12, 31),
        datetime.date.max,
        datetime.date.min,
        datetime.datetime(1985,  4, 13, 13,  5),
        datetime.datetime(   1,  1,  2,  0,  0),
        datetime.datetime(9999, 12, 31, 23, 59, 59),
        datetime.datetime.max,
        datetime.datetime.min
    ]
]).toDF((
    "ID",
    "Date_Normal",
    "Date_Low",
    "Date_High",
    "Date_Max",
    "Date_Min",
    "Timestamp_Normal",
    "Timestamp_Low",
    "Timestamp_High",
    "Timestamp_Max",
    "Timestamp_Min"
))

display( df )

spark.conf.set('spark.sql.legacy.parquet.int96RebaseModeInWrite', 'CORRECTED')
spark.conf.set('spark.sql.legacy.parquet.datetimeRebaseModeInWrite', 'CORRECTED')

(df.coalesce(1)
  .write
  .mode('overwrite')
  .format('parquet')
  .save('tmp/spark_datetime/')
)

part-00000-f85e122f-806f-4375-91da-04de38bc0c9c-c000.snappy.parquet.zip

Describe the bug When a Parquet file contains very low or very high date and timestamp values, this causes trouble:

Screenshots image

Additional context The problem might be related to https://issues.apache.org/jira/browse/SPARK-31404. Spark changed calendar between Spark 2.4 and 3.0.

Note: This tool relies on the parquet-dotnet library for all the actual Parquet processing. So any issues where that library cannot process a parquet file will not be addressed by us. Please open a ticket on that library's repo to address such issues.

mukunku commented 1 year ago

Thanks for the detailed report and sample file. I made changes so that you won't see a blank value anymore for 0001-01-01. But I wasn't able to re-create the error you experienced with the 9999-12-31 dates. Can you try the latest beta release and see how it looks? https://github.com/mukunku/ParquetViewer/releases/tag/v2.3.7

keen85 commented 1 year ago

Hi @mukunku,

Awesome! I ran a retest and it looks all good to me with the v2.3.7 pre-release 👍 image

Thanks a lot!