kaskada-ai / kaskada

Modern, open-source event-processing
https://kaskada.io/
Apache License 2.0
351 stars 15 forks source link

feat: Numbers should cast to timestamps as Unix timestamps #257

Open kerinin opened 1 year ago

kerinin commented 1 year ago

Summary Loading a CSV containing timestamps encoded as unix timestamps results in times that are all in 1970. We should treat integers as seconds since the unix epoch when converting to timestamps. We should also allow casting decimals to timestamps.

Example CSV that should be parseable:

t,k,v
1682019069.0,foo,bar
bjchambers commented 1 year ago

I suspect this is because of how prepare and casting use arrow's cast kernel. Specifically, if we go from i64 to timestamp_ns, I believe it just does a cast interpreting the number as nanoseconds. We probably need to go from i64 -> timestamp_ms -> timestamp_ns, or otherwise do math on the i64 before converting.

When we do this, we should make sure to do it everywhere:

  1. In prepare, when using an i64 column as the timestamp
  2. In Fenl when writing number_column as timestamp_ns.
  3. Etc.

Specifically, we should make sure that shift_to(number_column as timestamp_ns) is basically the same as using number_column as the time (assuming ordering is correct, etc.).

bjchambers commented 1 year ago

Started looking at this. I'm not sure it's as straightforward as described, especially the case of casting.

The intention of "casting" is to convert between different types. number as timestamp_ns is indicating "interpret the number as a nanosecond timestamp". It doesn't seem like this should "treat the number as milliseconds and then convert that to milliseconds". And there is a way to write that -- (number as timestamp_ms) as timestamp_ns.

For the case of the time column, it may be better to allow specifying a time_unit that defaults to ms and informs how we convert the time in prepare. We could even have the time_unit option default to ms which would change the default behavior as desired.

kerinin commented 1 year ago

Maybe the best solution is to separate the casting use case from the timestamp-assignment use case. In the case of casting, you're describing operations in Fenl, so it's possible to do things like you mention where multiple casts are used. At the moment we attempt to infer a timestamp automagically, and the inference is done prior to code execution so there's no way to describe complex operations.

What do you think about adding a "format" config to the table? I'm thinking something along the lines of this:

message Table {
  // ...
  string time_column_name = 0 [(validate.rules).string.min_len = 1];
  oneof time_column_format {
    // The time column describes the amount of time since the unix epoch.
    // Valid values are "s", "ms" and "ns", and describe the time unit.
    // For example "ms" indicates that values describe the number of milliseconds since the unix epoch.
    string unix_timestamp = 0;

    // The time column describes a string-formatted timestamp.
    // For example, "%Y-%m-%d %H:%M:%S" would correspond to the timestamp string "2015-09-05 23:56:04"
    string format_string = 0;
  }
  // ...
}

This should give us the information we need to convert the file into the expected timestamp type during preparation, without requiring invasive changes to the rest of the engine or the casting semantics. It would also give us a place to extend if we wanted to accept more formats.

bjchambers commented 1 year ago

One issue with the above, is that it could potentially contradict the schema in the Parquet file, etc. It may be better to handle this as part of a "richer" schema for CSV files.