ibis-project / ibis

the portable Python dataframe library
https://ibis-project.org
Apache License 2.0
5.33k stars 599 forks source link

feat: Support for DuckDB’s RANGE function with two timestamps for BigQuery Compatibility #10487

Open BugaM opened 1 week ago

BugaM commented 1 week ago

Is your feature request related to a problem?

In DuckDB, the RANGE function requires a third argument (an interval) when working with timestamps. This differs from BigQuery’s implementation, where the RANGE function can take two timestamp arguments without an interval. This incompatibility prevents the following query, which works in BigQuery, from running in DuckDB.

import ibis
import pandas as pd
from datetime import datetime

conn = ibis.duckdb.connect()

data = pd.DataFrame({
    'id': [1, 2, 1, 2],
    'value': [10, 20, 30, 40],
    'timestamp': [
        datetime(2023, 1, 1, 12, 0),
        datetime(2023, 1, 2, 12, 0),
        datetime(2023, 1, 3, 12, 0),
        datetime(2023, 1, 4, 12, 0)
    ]
})

conn.create_table('test_table', data)

query = conn.sql("""
    SELECT id, value, RANGE(timestamp, LEAD(timestamp) OVER (PARTITION BY id ORDER BY timestamp ASC)) as timestamp_range
    FROM test_table 
""", dialect="bigquery")

Error:

duckdb.duckdb.BinderException: Binder Error: No function matches the given name and argument types 'range(TIMESTAMP, TIMESTAMP)'. You might need to add explicit type casts.
        Candidate functions:
        range(BIGINT) -> BIGINT[]
        range(BIGINT, BIGINT) -> BIGINT[]
        range(BIGINT, BIGINT, BIGINT) -> BIGINT[]
        range(TIMESTAMP, TIMESTAMP, INTERVAL) -> TIMESTAMP[]
        range(TIMESTAMP WITH TIME ZONE, TIMESTAMP WITH TIME ZONE, INTERVAL) -> TIMESTAMP WITH TIME ZONE[]

LINE 1: DESCRIBE SELECT id, value, RANGE(timestamp, LEAD(timestamp) OVER (...

The same query works in BigQuery:

WITH test_table AS (
    SELECT 1 AS id, 10 AS value, TIMESTAMP("2023-01-01 12:00:00") AS timestamp UNION ALL
    SELECT 2 AS id, 20 AS value, TIMESTAMP("2023-01-02 12:00:00") AS timestamp UNION ALL
    SELECT 1 AS id, 30 AS value, TIMESTAMP("2023-01-03 12:00:00") AS timestamp UNION ALL
    SELECT 2 AS id, 40 AS value, TIMESTAMP("2023-01-04 12:00:00") AS timestamp
)

SELECT id, value, RANGE(timestamp, LEAD(timestamp) OVER (PARTITION BY id ORDER BY timestamp ASC)) as timestamp_range
    FROM test_table 

Resulting:

image

What is the motivation behind your request?

I’m using Ibis to test BigQuery queries locally. This feature is needed to make sure BigQuery operators work the same way in Ibis. Specifically, it would support doing as of joins as suggested by the documentation. Enabling it also envolves adding the RANGE_CONTAINS function.

Describe the solution you'd like

Add support to BQ's RANGE and RANGE_CONTAINS functions when BigQuery dialect is chosen.

What version of ibis are you running?

9.5.0

What backend(s) are you using, if any?

DuckDB, BigQuery

Code of Conduct

gforsyth commented 2 days ago

Hi @BugaM -- if you are executing SQL queries via .sql then the translation issue here is with SQLGlot, not with Ibis.
sqlglot is what we use for handling dialect translation, and while we subclass some dialects to make tweaks, DuckDB has pretty robust support, so we currently don't do that.

This seems like an issue to open upstream at https://github.com/tobymao/sqlglot