feat: 1414: Adds time_bucket() support for duckdb fdw

devanbenz commented 2 months ago

Signed-off-by: Devan devandbenz@gmail.com# Ticket(s) Closed

Closes paradedb/pg_analytics#52 https://github.com/paradedb/pg_analytics/issues/52

What

Adds support for time_bucket time series data function found within DuckDB: https://duckdb.org/docs/sql/functions/date.html#time_bucketbucket_width-date-offset
Exposes duckdb syntax for this function and does not currently support native postgres tables. This will require the table to be a FDW within duckdb for queries.

There are multiple candidate functions that are "overloaded" for postgres usage

`time_bucket(INTERVAL 5 MINUTE, timestamp::DATE)` -> aggregate by 5 min and return DATE
`time_bucket(INTERVAL 5 MINUTE, timestamp)` -> aggregate by 5 min and return timestamp
`time_bucket(INTERVAL 5 MINUTE, timestamp::DATE, INTERVAL '15 DAYS')` -> aggregate by 5 min with a 15 day offset and return DATE
`time_bucket(INTERVAL 5 MINUTE, timestamp, INTERVAL '15 DAYS')` -> aggregate by 5 min with a 15 day offset and return timestamp
`time_bucket(INTERVAL 5 MINUTE, timestamp::DATE, DATE '1990-01-01')` -> aggregate by 5 min with a DATE origin and return date

Tests

test_time_bucket_minutes -> tests for aggregating by minutes/timestamps test_time_bucket_years -> tests for aggregating by date test_time_bucket_fallback -> test to ensure that when used on a postgres table it falls back (there is currently no functioning UDF this is only working with duckdb FDW)

devanbenz commented 2 months ago

I am actually going to natively make this supported by Postgres. Will let you know when the PR is ready again.

devanbenz commented 2 months ago

Just need to add some more tests and implement the offsets/origin but currently the date and time base time_bucket() functions are working with postgres + duckdb. TIL that in order to debug pg extensions you need to attach to the running PID of postgres 😅

devanbenz commented 2 months ago

Through some edge-case testing I realized that my implementation was not working as expected. I went through a few iterations locally but ultimately implemented the following calculation which when used against my edge cases is working correctly.

result = ((date - origin_unix_epoch) / interval) * interval

This will truncate the chunk of time to make it divisible by the interval. Whatever the result is will be the time bucket (in epoch seconds) that then gets converted to a regular PG timestamp or date. Currently my Timestamp implementation supports UTC timestamps. I will be creating another function using the PG type TimestampWithTimeZone for timezone support in the following days. Prior to implementing this arithmetic I was thinking about setting up a sliding window and then doing some sort of search on it but a simple calculation should be much much MUCH less computationally complex.

Sorry this is taking a bit longer then expected 😅 It appears to be more complex then I initially though. I will also be writing more tests for this specifically for the postgres UDF. I personally am a big fan of writing more tests then actual feature code 😆.

philippemnoel commented 2 months ago

Through some edge-case testing I realized that my implementation was not working as expected. I went through a few iterations locally but ultimately implemented the following calculation which when used against my edge cases is working correctly.

result = ((date - origin_unix_epoch) / interval) * interval

This will truncate the chunk of time to make it divisible by the interval. Whatever the result is will be the time bucket (in epoch seconds) that then gets converted to a regular PG timestamp or date. Currently my Timestamp implementation supports UTC timestamps. I will be creating another function using the PG type TimestampWithTimeZone for timezone support in the following days. Prior to implementing this arithmetic I was thinking about setting up a sliding window and then doing some sort of search on it but a simple calculation should be much much MUCH less computationally complex.

Sorry this is taking a bit longer then expected 😅 It appears to be more complex then I initially though. I will also be writing more tests for this specifically for the postgres UDF. I personally am a big fan of writing more tests then actual feature code 😆.

Appreciate the thoroughness and the tests, you're doing great. Tagging @od0 who requested the feature in the first place to check the logic

philippemnoel commented 2 months ago

Through some edge-case testing I realized that my implementation was not working as expected. I went through a few iterations locally but ultimately implemented the following calculation which when used against my edge cases is working correctly.

result = ((date - origin_unix_epoch) / interval) * interval

This will truncate the chunk of time to make it divisible by the interval. Whatever the result is will be the time bucket (in epoch seconds) that then gets converted to a regular PG timestamp or date. Currently my Timestamp implementation supports UTC timestamps. I will be creating another function using the PG type TimestampWithTimeZone for timezone support in the following days. Prior to implementing this arithmetic I was thinking about setting up a sliding window and then doing some sort of search on it but a simple calculation should be much much MUCH less computationally complex.

Sorry this is taking a bit longer then expected 😅 It appears to be more complex then I initially though. I will also be writing more tests for this specifically for the postgres UDF. I personally am a big fan of writing more tests then actual feature code 😆.

This may be a relevant resource: https://github.com/timescale/timescaledb/blob/6ed3e2bdb139fc6bf8c653ed9b4e609c695b985e/src/time_bucket.c#L260

rebasedming commented 2 months ago

If you want an escape hatch from testing and maintaining our own time_bucket implementation, just create a time_bucket function in Postgres that has the same signature as the DuckDB one and have it throw an error. Users should only hit this error if the DuckDB query fails and it falls back to Postgres, which is better than this function not being supported at all.

devanbenz commented 2 months ago

If you want an escape hatch from testing and maintaining our own time_bucket implementation, just create a time_bucket function in Postgres that has the same signature as the DuckDB one and have it throw an error. Users should only hit this error if the DuckDB query fails and it falls back to Postgres, which is better than this function not being supported at all.

Okay so my understanding is the following:

An end user attempts to query a non-FDW table using time_bucket() this should immediately throw an error indicating that they are attempting to use a function that is not currently supported.
Any and all FDW queries using time_bucket() will just be pushed down to DuckDB and if that fails it will also throw.

@rebasedming is that what you're thinking? @philippemnoel & @od0 I want to verify that sounds okay feature-wise. I have most of the native UDF (except offset) implemented for postgres but I can modify the code to follow that flow if preferred. This was the original route I was going down & can rollback commits to then no problem. Regardless--I'm okay with either.

philippemnoel commented 2 months ago

If you want an escape hatch from testing and maintaining our own time_bucket implementation, just create a time_bucket function in Postgres that has the same signature as the DuckDB one and have it throw an error. Users should only hit this error if the DuckDB query fails and it falls back to Postgres, which is better than this function not being supported at all.

Okay so my understanding is the following:

An end user attempts to query a non-FDW table using time_bucket() this should immediately throw an error indicating that they are attempting to use a function that is not currently supported.

Any and all FDW queries using time_bucket() will just be pushed down to DuckDB and if that fails it will also throw.

@rebasedming is that what you're thinking? @philippemnoel & @od0 I want to verify that sounds okay feature-wise. I have most of the native UDF (except offset) implemented for postgres but I can modify the code to follow that flow if preferred. This was the original route I was going down & can rollback commits to then no problem. Regardless--I'm okay with either.

I'll defer to @rebasedming on the final call, but my understanding is time-series over regular Postgres table is probably not very useful. While it's cool to have it be implemented for any table, we expect any analytics queries to be done only in the DuckDB-powered tables, since the performance in Postgres tables would be poor.

That doesn't mean it wouldn't be used, but definitely less used. What you could do is scope this PR into two. The first one does what Ming suggested, and the second one implements the time_bucket in Postgres

devanbenz commented 2 months ago

If you want an escape hatch from testing and maintaining our own time_bucket implementation, just create a time_bucket function in Postgres that has the same signature as the DuckDB one and have it throw an error. Users should only hit this error if the DuckDB query fails and it falls back to Postgres, which is better than this function not being supported at all.

Okay so my understanding is the following:

An end user attempts to query a non-FDW table using time_bucket() this should immediately throw an error indicating that they are attempting to use a function that is not currently supported.

Any and all FDW queries using time_bucket() will just be pushed down to DuckDB and if that fails it will also throw.

@rebasedming is that what you're thinking? @philippemnoel & @od0 I want to verify that sounds okay feature-wise. I have most of the native UDF (except offset) implemented for postgres but I can modify the code to follow that flow if preferred. This was the original route I was going down & can rollback commits to then no problem. Regardless--I'm okay with either.

I'll defer to @rebasedming on the final call, but my understanding is time-series over regular Postgres table is probably not very useful. While it's cool to have it be implemented for any table, we expect any analytics queries to be done only in the DuckDB-powered tables, since the performance in Postgres tables would be poor.

That doesn't mean it wouldn't be used, but definitely less used. What you could do is scope this PR into two. The first one does what Ming suggested, and the second one implements the time_bucket in Postgres

Great that sounds good with me!

devanbenz commented 2 months ago

@rebasedming when you get a chance this is now ready

devanbenz commented 2 months ago

@rebasedming thanks for taking a look :) -- will address everything later tonight or tomorrow morning 👍

UPDATE: Changes have been made

devanbenz commented 2 months ago

@rebasedming do you know if anyone is working on fixing the failing test in dev? taking a look at the CI the test that is failing is not associated with this PR.

devanbenz commented 2 months ago

Awesome I see it was fixed :)

philippemnoel commented 2 months ago

Awesome I see it was fixed :)

We reverted the faulty PR, sorry about that! And apologies for the silence here, we've been slammed on some search work for a customer, but we'll get back to reviewing analytics PRs very soon!

od0 commented 2 months ago

hey @devanbenz awesome work on this!

just built and tested from your branch. works great from what I can see, with one caveat – when applying a limit to a query with time_bucket, the bucket is still calculated for all rows in the table

e.g. select time_bucket('5m', trips.tpep_pickup_datetime) from trips limit 10; touches all rows not just the first 10

using a subquery seems to work fine as a workaround select time_bucket('5m', t.tpep_pickup_datetime) from (select * from trips limit 10) t;

paradedb / pg_analytics

feat: 1414: Adds time_bucket() support for duckdb fdw #32

What

Tests