Open kerinin opened 1 year ago
I suspect this is because of how prepare and casting use arrow's cast kernel. Specifically, if we go from i64
to timestamp_ns
, I believe it just does a cast interpreting the number as nanoseconds. We probably need to go from i64 -> timestamp_ms -> timestamp_ns
, or otherwise do math on the i64
before converting.
When we do this, we should make sure to do it everywhere:
i64
column as the timestampnumber_column as timestamp_ns
.Specifically, we should make sure that shift_to(number_column as timestamp_ns)
is basically the same as using number_column
as the time (assuming ordering is correct, etc.).
Started looking at this. I'm not sure it's as straightforward as described, especially the case of casting.
The intention of "casting" is to convert between different types. number as timestamp_ns
is indicating "interpret the number as a nanosecond timestamp". It doesn't seem like this should "treat the number as milliseconds and then convert that to milliseconds". And there is a way to write that -- (number as timestamp_ms) as timestamp_ns
.
For the case of the time column, it may be better to allow specifying a time_unit
that defaults to ms
and informs how we convert the time in prepare. We could even have the time_unit
option default to ms
which would change the default behavior as desired.
Maybe the best solution is to separate the casting use case from the timestamp-assignment use case. In the case of casting, you're describing operations in Fenl, so it's possible to do things like you mention where multiple casts are used. At the moment we attempt to infer a timestamp automagically, and the inference is done prior to code execution so there's no way to describe complex operations.
What do you think about adding a "format" config to the table? I'm thinking something along the lines of this:
message Table {
// ...
string time_column_name = 0 [(validate.rules).string.min_len = 1];
oneof time_column_format {
// The time column describes the amount of time since the unix epoch.
// Valid values are "s", "ms" and "ns", and describe the time unit.
// For example "ms" indicates that values describe the number of milliseconds since the unix epoch.
string unix_timestamp = 0;
// The time column describes a string-formatted timestamp.
// For example, "%Y-%m-%d %H:%M:%S" would correspond to the timestamp string "2015-09-05 23:56:04"
string format_string = 0;
}
// ...
}
This should give us the information we need to convert the file into the expected timestamp type during preparation, without requiring invasive changes to the rest of the engine or the casting semantics. It would also give us a place to extend if we wanted to accept more formats.
One issue with the above, is that it could potentially contradict the schema in the Parquet file, etc. It may be better to handle this as part of a "richer" schema for CSV files.
Summary Loading a CSV containing timestamps encoded as unix timestamps results in times that are all in 1970. We should treat integers as seconds since the unix epoch when converting to timestamps. We should also allow casting decimals to timestamps.
Example CSV that should be parseable: