Closed devanbenz closed 2 months ago
I am actually going to natively make this supported by Postgres. Will let you know when the PR is ready again.
Just need to add some more tests and implement the offsets/origin but currently the date and time base time_bucket() functions are working with postgres + duckdb. TIL that in order to debug pg extensions you need to attach to the running PID of postgres π
Through some edge-case testing I realized that my implementation was not working as expected. I went through a few iterations locally but ultimately implemented the following calculation which when used against my edge cases is working correctly.
result = ((date - origin_unix_epoch) / interval) * interval
This will truncate the chunk of time to make it divisible by the interval. Whatever the result is will be the time bucket (in epoch seconds) that then gets converted to a regular PG timestamp or date. Currently my Timestamp implementation supports UTC timestamps. I will be creating another function using the PG type TimestampWithTimeZone
for timezone support in the following days. Prior to implementing this arithmetic I was thinking about setting up a sliding window and then doing some sort of search on it but a simple calculation should be much much MUCH less computationally complex.
Sorry this is taking a bit longer then expected π It appears to be more complex then I initially though. I will also be writing more tests for this specifically for the postgres UDF. I personally am a big fan of writing more tests then actual feature code π.
Through some edge-case testing I realized that my implementation was not working as expected. I went through a few iterations locally but ultimately implemented the following calculation which when used against my edge cases is working correctly.
result = ((date - origin_unix_epoch) / interval) * interval
This will truncate the chunk of time to make it divisible by the interval. Whatever the result is will be the time bucket (in epoch seconds) that then gets converted to a regular PG timestamp or date. Currently my Timestamp implementation supports UTC timestamps. I will be creating another function using the PG type
TimestampWithTimeZone
for timezone support in the following days. Prior to implementing this arithmetic I was thinking about setting up a sliding window and then doing some sort of search on it but a simple calculation should be much much MUCH less computationally complex.Sorry this is taking a bit longer then expected π It appears to be more complex then I initially though. I will also be writing more tests for this specifically for the postgres UDF. I personally am a big fan of writing more tests then actual feature code π.
Appreciate the thoroughness and the tests, you're doing great. Tagging @od0 who requested the feature in the first place to check the logic
Through some edge-case testing I realized that my implementation was not working as expected. I went through a few iterations locally but ultimately implemented the following calculation which when used against my edge cases is working correctly.
result = ((date - origin_unix_epoch) / interval) * interval
This will truncate the chunk of time to make it divisible by the interval. Whatever the result is will be the time bucket (in epoch seconds) that then gets converted to a regular PG timestamp or date. Currently my Timestamp implementation supports UTC timestamps. I will be creating another function using the PG type
TimestampWithTimeZone
for timezone support in the following days. Prior to implementing this arithmetic I was thinking about setting up a sliding window and then doing some sort of search on it but a simple calculation should be much much MUCH less computationally complex.Sorry this is taking a bit longer then expected π It appears to be more complex then I initially though. I will also be writing more tests for this specifically for the postgres UDF. I personally am a big fan of writing more tests then actual feature code π.
This may be a relevant resource: https://github.com/timescale/timescaledb/blob/6ed3e2bdb139fc6bf8c653ed9b4e609c695b985e/src/time_bucket.c#L260
If you want an escape hatch from testing and maintaining our own time_bucket
implementation, just create a time_bucket
function in Postgres that has the same signature as the DuckDB one and have it throw an error. Users should only hit this error if the DuckDB query fails and it falls back to Postgres, which is better than this function not being supported at all.
If you want an escape hatch from testing and maintaining our own
time_bucket
implementation, just create atime_bucket
function in Postgres that has the same signature as the DuckDB one and have it throw an error. Users should only hit this error if the DuckDB query fails and it falls back to Postgres, which is better than this function not being supported at all.
Okay so my understanding is the following:
time_bucket()
this should immediately throw an error indicating that they are attempting to use a function that is not currently supported.@rebasedming is that what you're thinking? @philippemnoel & @od0 I want to verify that sounds okay feature-wise. I have most of the native UDF (except offset) implemented for postgres but I can modify the code to follow that flow if preferred. This was the original route I was going down & can rollback commits to then no problem. Regardless--I'm okay with either.
If you want an escape hatch from testing and maintaining our own
time_bucket
implementation, just create atime_bucket
function in Postgres that has the same signature as the DuckDB one and have it throw an error. Users should only hit this error if the DuckDB query fails and it falls back to Postgres, which is better than this function not being supported at all.Okay so my understanding is the following:
- An end user attempts to query a non-FDW table using
time_bucket()
this should immediately throw an error indicating that they are attempting to use a function that is not currently supported.- Any and all FDW queries using time_bucket() will just be pushed down to DuckDB and if that fails it will also throw.
@rebasedming is that what you're thinking? @philippemnoel & @od0 I want to verify that sounds okay feature-wise. I have most of the native UDF (except offset) implemented for postgres but I can modify the code to follow that flow if preferred. This was the original route I was going down & can rollback commits to then no problem. Regardless--I'm okay with either.
I'll defer to @rebasedming on the final call, but my understanding is time-series over regular Postgres table is probably not very useful. While it's cool to have it be implemented for any table, we expect any analytics queries to be done only in the DuckDB-powered tables, since the performance in Postgres tables would be poor.
That doesn't mean it wouldn't be used, but definitely less used. What you could do is scope this PR into two. The first one does what Ming suggested, and the second one implements the time_bucket in Postgres
If you want an escape hatch from testing and maintaining our own
time_bucket
implementation, just create atime_bucket
function in Postgres that has the same signature as the DuckDB one and have it throw an error. Users should only hit this error if the DuckDB query fails and it falls back to Postgres, which is better than this function not being supported at all.Okay so my understanding is the following:
- An end user attempts to query a non-FDW table using
time_bucket()
this should immediately throw an error indicating that they are attempting to use a function that is not currently supported.- Any and all FDW queries using time_bucket() will just be pushed down to DuckDB and if that fails it will also throw.
@rebasedming is that what you're thinking? @philippemnoel & @od0 I want to verify that sounds okay feature-wise. I have most of the native UDF (except offset) implemented for postgres but I can modify the code to follow that flow if preferred. This was the original route I was going down & can rollback commits to then no problem. Regardless--I'm okay with either.
I'll defer to @rebasedming on the final call, but my understanding is time-series over regular Postgres table is probably not very useful. While it's cool to have it be implemented for any table, we expect any analytics queries to be done only in the DuckDB-powered tables, since the performance in Postgres tables would be poor.
That doesn't mean it wouldn't be used, but definitely less used. What you could do is scope this PR into two. The first one does what Ming suggested, and the second one implements the time_bucket in Postgres
Great that sounds good with me!
@rebasedming when you get a chance this is now ready
@rebasedming thanks for taking a look :) -- will address everything later tonight or tomorrow morning π
UPDATE: Changes have been made
@rebasedming do you know if anyone is working on fixing the failing test in dev? taking a look at the CI the test that is failing is not associated with this PR.
Awesome I see it was fixed :)
Awesome I see it was fixed :)
We reverted the faulty PR, sorry about that! And apologies for the silence here, we've been slammed on some search work for a customer, but we'll get back to reviewing analytics PRs very soon!
hey @devanbenz awesome work on this!
just built and tested from your branch. works great from what I can see, with one caveat β when applying a limit
to a query with time_bucket
, the bucket is still calculated for all rows in the table
e.g. select time_bucket('5m', trips.tpep_pickup_datetime) from trips limit 10;
touches all rows not just the first 10
using a subquery seems to work fine as a workaround select time_bucket('5m', t.tpep_pickup_datetime) from (select * from trips limit 10) t;
Signed-off-by: Devan devandbenz@gmail.com# Ticket(s) Closed
What
time_bucket
time series data function found within DuckDB: https://duckdb.org/docs/sql/functions/date.html#time_bucketbucket_width-date-offsetTests
test_time_bucket_minutes -> tests for aggregating by minutes/timestamps test_time_bucket_years -> tests for aggregating by date test_time_bucket_fallback -> test to ensure that when used on a postgres table it falls back (there is currently no functioning UDF this is only working with duckdb FDW)