pola-rs / polars

Dataframes powered by a multithreaded, vectorized query engine, written in Rust
https://docs.pola.rs
Other
27.25k stars 1.67k forks source link

RFC: Additional SQL functions #7227

Open universalmind303 opened 1 year ago

universalmind303 commented 1 year ago

Problem description

Using this as a sort of backlog for polars-sql functionality.

This is by no means an exhaustive list of sql functionality, but serves as an unprioritized checklist of functionality that can be added to polars sql. Some of the functions below may be irrelevant in polars-sql.

Feel free to add/remove/suggest any other SQL functions that you feel are of importance to polars-sql.

Community requested

Functions

Miscellaneous

Pattern Matching

Null-aware Comparison

Conditional / Comparison

Mathematical

Random Functions

Trig Functions

String Functions

Binary String Functions

Binary String Conversion Functions

Date functions

Read Functions

Write functions

lucazanna commented 1 year ago

Hi @universalmind303 ,

When I think about SQL, I always think about CTEs and Window functions (including specifying the window frame with range/rows). Would you know if this is planned?

universalmind303 commented 1 year ago

@lucazanna we do support window functions sum(x) over (partition by y). If there is anything that specifically doesn't work, please provide some sql examples & we can look at adding. CTE's are planned & I updated the list to include CTEs

lucazanna commented 1 year ago

Thank you @universalmind303 for the CTE!

For the window function, I often use the 'frame type' option: row frame or range frame.

Below is a description I found for Spark SQL (https://www.databricks.com/blog/2015/07/15/introducing-window-functions-in-spark-sql.html)

OVER (PARTITION BY ... ORDER BY ... frame_type BETWEEN start AND end) Here, frame_type can be either ROWS (for ROW frame) or RANGE (for RANGE frame); start can be any of UNBOUNDED PRECEDING, CURRENT ROW, PRECEDING, and FOLLOWING; and end can be any of UNBOUNDED FOLLOWING, CURRENT ROW, PRECEDING, and FOLLOWING.

I would like to do something similar to the first answer on this SO question: https://stackoverflow.com/questions/30861919/what-is-rows-unbounded-preceding-used-for-in-teradata

And this is what I tried:

df = pl.LazyFrame({
    't': [1,2,3,4,5,6],
    'a': [1,5,3,5,4,11]
})

sql = pl.SQLContext()
sql.register('df',df)

sql.query("""
SELECT t, a, sum(a) OVER (ORDER BY t ROWS BETWEEN UNBOUNDED PRECEDING AND CURRENT ROW)
FROM df
ORDER BY t
""")

I have the error: ComputeError: expected keys in groupby operation, got nothing
2-5 commented 1 year ago

I actually wanted to open an Enhancement request regarding time functions.

If I have a pl.Datetime column, how can I test time components? Something like time_col.dt.hour < 10 in an SQL context. Seems like extract/date_part is required for that.

potter-potter commented 1 year ago

Yes please for CTEs! Main thing I feel is missing for SQL.

lucazanna commented 1 year ago

not sure how long it would take to implement them, but would it be possible to have the RANK function and the EXTRACT function?

Extract will allow for easier working with dates, and rank functions is useful in several analytical queries

universalmind303 commented 1 year ago

@ritchie46, I've been thinking about #8425, and wanted to get your thoughts on the direction you'd like to see out of polars sql support. Maybe this is a bit premature, as we don't even fully support all of the lazyframe api yet. But I see it as one of 2 directions.

a lightweight & familiar way of expressing a logical plan via SQL (simliar to datafusion)

This would entail exposing all functionality in the logical plan/lazyframe to SQL.

I think we are already well on track to this one. There's still a long way to go, but we're making good progress on increasing the SQL query functionality.

a full fledged database (similar to duckdb or sqlite)

So this would entail all of the above, but also

ritchie46 commented 1 year ago

Given that 2, needs 1. I'd say lets look at 1 first and see what the world looks like once we are upon that hill? :)

universalmind303 commented 1 year ago

Makes sense, Wasn't sure if there was a clear direction you wanted to head in. I guess it makes sense to do the first one before trying to implement ADBC, or some of those other larger tasks associated with the latter.

chitralverma commented 1 year ago

I'd like to add the following to this list,

csubhodeep commented 1 year ago

Could we please have a new category for basic functions like the ones requested in #8901 #8889 #8866?

lucazanna commented 1 year ago

What about adding priorities (1,2,3, etc) to the functions ? This can give users an idea if those functions are likely to be added earlier or later

universalmind303 commented 1 year ago

I've been mostly doing them when issues for specific feature requests come in. I updated the OP with a COMMUNITY REQUESTED section that I'll use as the highest priority features. feel free to suggest more either here, or by opening up individual feature requests. @lucazanna @csubhodeep

lucazanna commented 1 year ago

@universalmind303 : that's a good idea. can we add EXTRACT and RANK to the community requested? I see that EXTRACT was also requested by @2-5 above

and table aliasing :) When working with multiple tables joined together, that makes for more concise syntax

csubhodeep commented 1 year ago

I agree with @lucazanna. I think we could have a separate category of window functions. And have prios within them.

We could take the following as references:

  1. Window Functions
csubhodeep commented 1 year ago

Where could we keep a priority list for bug fixes such as #8890 .

Also @universalmind303 @ritchie46 @alexander-beedie could we please add #8918 to the Community requested section.

Also would there be a possibility to have some kind of polls for the functions/features mentioned above. It may require to move this thread from Issue to Discussions (which is not enabled for this repo I believe).

jqnatividad commented 1 year ago

+100 on moving and breaking up this thread to multiple threads in GitHub Discussions

jqnatividad commented 1 year ago

Polars SQL has CREATE TABLE <TBLNAME> AS SELECT support.

It'd be great if it also supports the complementary DROP TABLE and TRUNCATE TABLE statements.

jqnatividad commented 1 year ago

Additional date functions:

In addition, the OVERLAPS operator as well .

This would pretty much make Polars SQL at parity with PostgreSQL 15's date/time capabilities

hennyboiszz commented 12 months ago

request for additional SQL function

(edit typo ifnull to isnull)

returns the result of the expression in every single case, except if the resulting value is a null value. If it is a null value then return the alternative value which can be either a default value or another expression.

used it for error handling, to catch if any subqueries failed in a stored proc and if so provide them a value put in place of causing a stored proc to fail.

brunorpinho commented 3 months ago

I have two additional requests:

  1. More robust statistical functions like quantile, median, skew...

  2. to be able to register my own UDF to call from SQL if that is possible

alexander-beedie commented 2 months ago
  1. More robust statistical functions like quantile, median, skew...

@brunorpinho: FYI, just added MEDIAN - now available in 0.20.19; will have a look at percentiles a bit later. I don't believe there is a standard PostgreSQL function for skewness, but I note that DuckDB has once, so I can see if any other DBs also have it and add according to the most common syntax 🤔

alexander-beedie commented 2 months ago

ISNULL(expression, alt_value) => value

@hennyboiszz: This is equivalent to COALESCE(expression, alt_value), which we support 👌 (ISNULL is also not PostgreSQL syntax, which is the flavour of SQL that we most closely adhere to; IFNULL is also available though).

alexander-beedie commented 2 months ago

It'd be great if it also supports the complementary DROP TABLE and TRUNCATE TABLE statements.

@jqnatividad: Done - DROP TABLE was added a little while ago, and TRUNCATE TABLE just landed in 0.20.19 ;)

jqnatividad commented 2 months ago

Following up on @csubhodeep 's original May 2023 request to add PostgreSQL's window functions to Polars SQL, I'm adding my vote to add them...

https://www.postgresql.org/docs/current/functions-window.html