kaskada-ai / kaskada

Modern, open-source event-processing
https://kaskada.io/
Apache License 2.0
349 stars 15 forks source link

Parquet source doesn't respect `time_unit` parameter #785

Closed epinzur closed 11 months ago

epinzur commented 11 months ago

related to #776

Parquet files are now loading, but they do not respect the time_unit parameter. Example:

messages = await kd.sources.Parquet.create(
    "messages.parquet",
    time_column = "ts", 
    key_column = "channel",
    time_unit = "s"
)
messages.preview(5)
_time _key thread_ts ts channel user text
0 1970-01-01 00:00:01.690783200 General NaN 1690783200.000000 General fb3da5bf Good morning team! I hope everyone had a great...
1 1970-01-01 00:00:01.690783201 Project NaN 1690783201.000000 Project bb9d2b01 Good morning everyone! I had a relaxing weeken...
2 1970-01-01 00:00:01.690783202 General NaN 1690783202.000000 General 03cc4325 Morning all! My weekend was good. I'm also loo...
3 1970-01-01 00:00:01.690783203 General NaN 1690783203.000000 General 3e44cfa1 Hey everyone! Had a good weekend too. Ready to...
4 1970-01-01 00:00:01.690783204 General NaN 1690783204.000000 General ea27bbff Morning team! Hope you all had a great weekend...

When loading the same file with jsonl, 2023-07-31 06:00:00 is the parsed _time for the first row

epinzur commented 11 months ago

the test data for the above exists here: https://github.com/kaskada-ai/beep-gpt/blob/esp/generate_slack/slack-generation/messages.parquet

jordanrfrazier commented 11 months ago

Hm yeah. This is a feature new to the new prepare code, so the time unit hasn't been plumbed through the old yet.