sunchao / parquet-rs

Apache Parquet implementation in Rust
Apache License 2.0
149 stars 20 forks source link

Fails to read INT96[0, 0, 0] #148

Closed sadikovi closed 5 years ago

sadikovi commented 6 years ago

Following the discussion on the PR, we found that the code fails to read Int96[0, 0, 0]. It is quite an edge case, because 1 January 1970 would correspond to something like Int96[0, 0, 2440588].

This is result from Spark, when reading Int96[0, 0, 0], Int96[0, 0, 1], and Int96[1, 0, 0]:

+-------------------+
|                  a|
+-------------------+
|4713-01-01 01:00:00|
|4713-01-02 01:00:00|
|4713-01-01 01:00:00|
+-------------------+

Milliseconds:

Array(-210866803200000, -210866716800000, -210866803200000)

I tried patching the code, and this works and returns the exact milliseconds like from Spark:

pub fn convert_int96(_descr: &ColumnDescPtr, value: Int96) -> Self {
  const JULIAN_DAY_OF_EPOCH: i64 = 2_440_588;
  const SECONDS_PER_DAY: i64 = 86_400;
  const MICROS_PER_SECOND: i64 = 1_000;

  let day = value.data()[0] as i64;
  let nanoseconds = ((value.data()[1] as i64) << 32) + value.data()[0] as i64;
  let seconds = (day - JULIAN_DAY_OF_EPOCH) * SECONDS_PER_DAY;
  let millis = seconds * MICROS_PER_SECOND + nanoseconds / 1_000;

  Field::Timestamp(millis)
}

But when converting to a human-readable date, I get the following:

{a: 1970-01-01 01:00:00 +01:00}
{a: 1970-01-01 01:00:00 +01:00}
{a: 1970-01-01 01:00:00 +01:00}

It looks like chrono library only supports dates after 1 January 1970. I attached the sample file (in archive) with a single column of Int96.

sample.parquet.zip

sadikovi commented 6 years ago

@sunchao would you like to comment? We can fix it, but the code would return a wrong result different result compared to Spark.

sadikovi commented 6 years ago

By the way, file is written using WIP of parquet-rs write support!

sunchao commented 5 years ago

@sadikovi : can we close this issue? I believe this is largely resolved by #184?

sadikovi commented 5 years ago

Kind of. Yes, we can close it. On Wed, 7 Nov 2018 at 7:50 PM, Chao Sun notifications@github.com wrote:

@sadikovi https://github.com/sadikovi : can we close this issue? I believe this is largely resolved by #184 https://github.com/sunchao/parquet-rs/pull/184?

— You are receiving this because you were mentioned. Reply to this email directly, view it on GitHub https://github.com/sunchao/parquet-rs/issues/148#issuecomment-436735641, or mute the thread https://github.com/notifications/unsubscribe-auth/AHbY3oBAgrUwXXZNHTR3IUi8Vk7QYMh2ks5usyuEgaJpZM4V7LPa .