sunchao / parquet-rs

Apache Parquet implementation in Rust
Apache License 2.0
149 stars 20 forks source link

Convert FIXED_LEN_BYTE_ARRAY type with no logical type in record iterator #146

Closed sadikovi closed 6 years ago

sadikovi commented 6 years ago

This PR adds conversion for FIXED_LEN_BYTE_ARRAY when no logical type is specified. We convert it into Field::Bytes(byte_array).

Added test to verify conversion.

sadikovi commented 6 years ago

@sunchao can you review this PR?

I also found that INT96 values in 10k-v2.parquet file are [0, 0, 0]. I am not if this is a valid INT96, and our parsing currently does not handle those values and fails. Can you confirm that these values are valid/invalid?

coveralls commented 6 years ago

Pull Request Test Coverage Report for Build 585


Files with Coverage Reduction New Missed Lines %
record/api.rs 11 96.83%
<!-- Total: 11 -->
Totals Coverage Status
Change from base Build 580: 0.002%
Covered Lines: 11395
Relevant Lines: 11938

💛 - Coveralls
sunchao commented 6 years ago

I also found that INT96 values in 10k-v2.parquet file are [0, 0, 0]. I am not if this is a valid INT96, and our parsing currently does not handle those values and fails. Can you confirm that these values are valid/invalid?

I think it is valid since [0, 0, 0] represents int96 value 0, right? Is the failure relate to this PR?

sadikovi commented 6 years ago

It is not really related, but I found these limitations while trying to read 10k-v2.parquet file. It looks like [0,0,0] Int96 is a date long in past, for the default date of 1 Jan 1970 Int96 needs to have 2440588 as julian days. But it looks like [0, 0, 0] are allowed, since it is possible to write them. I will have a look.

sadikovi commented 6 years ago

I found that it overflows and wraps several times, and eventual time is int96_field: 1970-01-01 00:59:59 +01:00, with milliseconds being -7952618389194. The actual date of 1970-01-01-00:00:00 is Julian days = 2440588.

I can make it work with the following changes, let me know what we should do in this case. It would be interesting to know what Impala returns for INT96[0, 0, 0]:

diff --git a/src/record/api.rs b/src/record/api.rs
index 35eab7b..8caf29f 100644
--- a/src/record/api.rs
+++ b/src/record/api.rs
@@ -60,7 +60,7 @@ pub trait RowAccessor {
   fn get_long(&self, i: usize) -> Result<i64>;
   fn get_float(&self, i: usize) -> Result<f32>;
   fn get_double(&self, i: usize) -> Result<f64>;
-  fn get_timestamp(&self, i: usize) -> Result<u64>;
+  fn get_timestamp(&self, i: usize) -> Result<i64>;
   fn get_decimal(&self, i: usize) -> Result<&Decimal>;
   fn get_string(&self, i: usize) -> Result<&String>;
   fn get_bytes(&self, i: usize) -> Result<&ByteArray>;
@@ -105,7 +105,7 @@ impl RowAccessor for Row {
   row_primitive_accessor!(get_long, Long, i64);
   row_primitive_accessor!(get_float, Float, f32);
   row_primitive_accessor!(get_double, Double, f64);
-  row_primitive_accessor!(get_timestamp, Timestamp, u64);
+  row_primitive_accessor!(get_timestamp, Timestamp, i64);
   row_complex_accessor!(get_decimal, Decimal, Decimal);
   row_complex_accessor!(get_string, Str, String);
   row_complex_accessor!(get_bytes, Bytes, ByteArray);
@@ -165,7 +165,7 @@ pub trait ListAccessor {
   fn get_long(&self, i: usize) -> Result<i64>;
   fn get_float(&self, i: usize) -> Result<f32>;
   fn get_double(&self, i: usize) -> Result<f64>;
-  fn get_timestamp(&self, i: usize) -> Result<u64>;
+  fn get_timestamp(&self, i: usize) -> Result<i64>;
   fn get_decimal(&self, i: usize) -> Result<&Decimal>;
   fn get_string(&self, i: usize) -> Result<&String>;
   fn get_bytes(&self, i: usize) -> Result<&ByteArray>;
@@ -215,7 +215,7 @@ impl ListAccessor for List {
   list_primitive_accessor!(get_long, Long, i64);
   list_primitive_accessor!(get_float, Float, f32);
   list_primitive_accessor!(get_double, Double, f64);
-  list_primitive_accessor!(get_timestamp, Timestamp, u64);
+  list_primitive_accessor!(get_timestamp, Timestamp, i64);
   list_complex_accessor!(get_decimal, Decimal, Decimal);
   list_complex_accessor!(get_string, Str, String);
   list_complex_accessor!(get_bytes, Bytes, ByteArray);
@@ -278,7 +278,7 @@ impl<'a> ListAccessor for MapList<'a> {
   map_list_primitive_accessor!(get_long, Long, i64);
   map_list_primitive_accessor!(get_float, Float, f32);
   map_list_primitive_accessor!(get_double, Double, f64);
-  map_list_primitive_accessor!(get_timestamp, Timestamp, u64);
+  map_list_primitive_accessor!(get_timestamp, Timestamp, i64);
   list_complex_accessor!(get_decimal, Decimal, Decimal);
   list_complex_accessor!(get_string, Str, String);
   list_complex_accessor!(get_bytes, Bytes, ByteArray);
@@ -330,7 +330,7 @@ pub enum Field {
   /// Unix epoch, 1 January 1970.
   Date(u32),
   /// Milliseconds from the Unix epoch, 1 January 1970.
-  Timestamp(u64),
+  Timestamp(i64),

   // ----------------------------------------------------------------------
   // Complex types
@@ -422,13 +422,13 @@ impl Field {
   /// `Timestamp` value.
   #[inline]
   pub fn convert_int96(_descr: &ColumnDescPtr, value: Int96) -> Self {
-    const JULIAN_TO_UNIX_EPOCH_DAYS: u64 = 2_440_588;
-    const MILLI_SECONDS_IN_A_DAY: u64 = 86_400_000;
-    const NANO_SECONDS_IN_A_DAY: u64 = MILLI_SECONDS_IN_A_DAY * 1_000_000;
+    const JULIAN_TO_UNIX_EPOCH_DAYS: i64 = 2_440_588;
+    const MILLI_SECONDS_IN_A_DAY: i64 = 86_400_000;
+    const NANO_SECONDS_IN_A_DAY: i64 = MILLI_SECONDS_IN_A_DAY * 1_000_000;

-    let days_since_epoch = value.data()[2] as u64 - JULIAN_TO_UNIX_EPOCH_DAYS;
-    let nanoseconds: u64 = ((value.data()[1] as u64) << 32) + value.data()[0] as u64;
-    let nanos = days_since_epoch * NANO_SECONDS_IN_A_DAY + nanoseconds;
+    let days_since_epoch = value.data()[2] as i64 - JULIAN_TO_UNIX_EPOCH_DAYS;
+    let nanoseconds = (value.data()[1] as i64).wrapping_shl(32) + value.data()[0] as i64;
+    let nanos = days_since_epoch.wrapping_mul(NANO_SECONDS_IN_A_DAY) + nanoseconds;
     let millis = nanos / 1_000_000;

     Field::Timestamp(millis)
@@ -547,7 +547,7 @@ fn convert_date_to_string(value: u32) -> String {
 /// Input `value` is a number of milliseconds since the epoch in UTC.
 /// Datetime is displayed in local timezone.
 #[inline]
-fn convert_timestamp_to_string(value: u64) -> String {
+fn convert_timestamp_to_string(value: i64) -> String {
   let dt = Local.timestamp((value / 1000) as i64, 0);
   format!("{}", dt.format("%Y-%m-%d %H:%M:%S %:z"))
 }
sunchao commented 6 years ago

Merged. Thanks @sadikovi . Can you file a new issue (or PR) for the int96 issue? I can also take a look at how Impala does in this case.

sadikovi commented 6 years ago

Oh, thanks! Yes, I will open another PR for Int96. Would appreciate, if you could check how Impala treats such timestamps!