Closed sadikovi closed 6 years ago
@sunchao can you review this PR?
I also found that INT96 values in 10k-v2.parquet
file are [0, 0, 0]. I am not if this is a valid INT96, and our parsing currently does not handle those values and fails. Can you confirm that these values are valid/invalid?
Files with Coverage Reduction | New Missed Lines | % | ||
---|---|---|---|---|
record/api.rs | 11 | 96.83% | ||
<!-- | Total: | 11 | --> |
Totals | |
---|---|
Change from base Build 580: | 0.002% |
Covered Lines: | 11395 |
Relevant Lines: | 11938 |
I also found that INT96 values in 10k-v2.parquet file are [0, 0, 0]. I am not if this is a valid INT96, and our parsing currently does not handle those values and fails. Can you confirm that these values are valid/invalid?
I think it is valid since [0, 0, 0]
represents int96 value 0, right? Is the failure relate to this PR?
It is not really related, but I found these limitations while trying to read 10k-v2.parquet file. It looks like [0,0,0] Int96 is a date long in past, for the default date of 1 Jan 1970 Int96 needs to have 2440588 as julian days. But it looks like [0, 0, 0] are allowed, since it is possible to write them. I will have a look.
I found that it overflows and wraps several times, and eventual time is int96_field: 1970-01-01 00:59:59 +01:00
, with milliseconds being -7952618389194. The actual date of 1970-01-01-00:00:00 is Julian days = 2440588.
I can make it work with the following changes, let me know what we should do in this case. It would be interesting to know what Impala returns for INT96[0, 0, 0]:
diff --git a/src/record/api.rs b/src/record/api.rs
index 35eab7b..8caf29f 100644
--- a/src/record/api.rs
+++ b/src/record/api.rs
@@ -60,7 +60,7 @@ pub trait RowAccessor {
fn get_long(&self, i: usize) -> Result<i64>;
fn get_float(&self, i: usize) -> Result<f32>;
fn get_double(&self, i: usize) -> Result<f64>;
- fn get_timestamp(&self, i: usize) -> Result<u64>;
+ fn get_timestamp(&self, i: usize) -> Result<i64>;
fn get_decimal(&self, i: usize) -> Result<&Decimal>;
fn get_string(&self, i: usize) -> Result<&String>;
fn get_bytes(&self, i: usize) -> Result<&ByteArray>;
@@ -105,7 +105,7 @@ impl RowAccessor for Row {
row_primitive_accessor!(get_long, Long, i64);
row_primitive_accessor!(get_float, Float, f32);
row_primitive_accessor!(get_double, Double, f64);
- row_primitive_accessor!(get_timestamp, Timestamp, u64);
+ row_primitive_accessor!(get_timestamp, Timestamp, i64);
row_complex_accessor!(get_decimal, Decimal, Decimal);
row_complex_accessor!(get_string, Str, String);
row_complex_accessor!(get_bytes, Bytes, ByteArray);
@@ -165,7 +165,7 @@ pub trait ListAccessor {
fn get_long(&self, i: usize) -> Result<i64>;
fn get_float(&self, i: usize) -> Result<f32>;
fn get_double(&self, i: usize) -> Result<f64>;
- fn get_timestamp(&self, i: usize) -> Result<u64>;
+ fn get_timestamp(&self, i: usize) -> Result<i64>;
fn get_decimal(&self, i: usize) -> Result<&Decimal>;
fn get_string(&self, i: usize) -> Result<&String>;
fn get_bytes(&self, i: usize) -> Result<&ByteArray>;
@@ -215,7 +215,7 @@ impl ListAccessor for List {
list_primitive_accessor!(get_long, Long, i64);
list_primitive_accessor!(get_float, Float, f32);
list_primitive_accessor!(get_double, Double, f64);
- list_primitive_accessor!(get_timestamp, Timestamp, u64);
+ list_primitive_accessor!(get_timestamp, Timestamp, i64);
list_complex_accessor!(get_decimal, Decimal, Decimal);
list_complex_accessor!(get_string, Str, String);
list_complex_accessor!(get_bytes, Bytes, ByteArray);
@@ -278,7 +278,7 @@ impl<'a> ListAccessor for MapList<'a> {
map_list_primitive_accessor!(get_long, Long, i64);
map_list_primitive_accessor!(get_float, Float, f32);
map_list_primitive_accessor!(get_double, Double, f64);
- map_list_primitive_accessor!(get_timestamp, Timestamp, u64);
+ map_list_primitive_accessor!(get_timestamp, Timestamp, i64);
list_complex_accessor!(get_decimal, Decimal, Decimal);
list_complex_accessor!(get_string, Str, String);
list_complex_accessor!(get_bytes, Bytes, ByteArray);
@@ -330,7 +330,7 @@ pub enum Field {
/// Unix epoch, 1 January 1970.
Date(u32),
/// Milliseconds from the Unix epoch, 1 January 1970.
- Timestamp(u64),
+ Timestamp(i64),
// ----------------------------------------------------------------------
// Complex types
@@ -422,13 +422,13 @@ impl Field {
/// `Timestamp` value.
#[inline]
pub fn convert_int96(_descr: &ColumnDescPtr, value: Int96) -> Self {
- const JULIAN_TO_UNIX_EPOCH_DAYS: u64 = 2_440_588;
- const MILLI_SECONDS_IN_A_DAY: u64 = 86_400_000;
- const NANO_SECONDS_IN_A_DAY: u64 = MILLI_SECONDS_IN_A_DAY * 1_000_000;
+ const JULIAN_TO_UNIX_EPOCH_DAYS: i64 = 2_440_588;
+ const MILLI_SECONDS_IN_A_DAY: i64 = 86_400_000;
+ const NANO_SECONDS_IN_A_DAY: i64 = MILLI_SECONDS_IN_A_DAY * 1_000_000;
- let days_since_epoch = value.data()[2] as u64 - JULIAN_TO_UNIX_EPOCH_DAYS;
- let nanoseconds: u64 = ((value.data()[1] as u64) << 32) + value.data()[0] as u64;
- let nanos = days_since_epoch * NANO_SECONDS_IN_A_DAY + nanoseconds;
+ let days_since_epoch = value.data()[2] as i64 - JULIAN_TO_UNIX_EPOCH_DAYS;
+ let nanoseconds = (value.data()[1] as i64).wrapping_shl(32) + value.data()[0] as i64;
+ let nanos = days_since_epoch.wrapping_mul(NANO_SECONDS_IN_A_DAY) + nanoseconds;
let millis = nanos / 1_000_000;
Field::Timestamp(millis)
@@ -547,7 +547,7 @@ fn convert_date_to_string(value: u32) -> String {
/// Input `value` is a number of milliseconds since the epoch in UTC.
/// Datetime is displayed in local timezone.
#[inline]
-fn convert_timestamp_to_string(value: u64) -> String {
+fn convert_timestamp_to_string(value: i64) -> String {
let dt = Local.timestamp((value / 1000) as i64, 0);
format!("{}", dt.format("%Y-%m-%d %H:%M:%S %:z"))
}
Merged. Thanks @sadikovi . Can you file a new issue (or PR) for the int96 issue? I can also take a look at how Impala does in this case.
Oh, thanks! Yes, I will open another PR for Int96. Would appreciate, if you could check how Impala treats such timestamps!
This PR adds conversion for FIXED_LEN_BYTE_ARRAY when no logical type is specified. We convert it into
Field::Bytes(byte_array)
.Added test to verify conversion.