pola-rs / polars

Dataframes powered by a multithreaded, vectorized query engine, written in Rust
https://docs.pola.rs
Other
30.31k stars 1.96k forks source link

HAVING in SQL is ignored #19310

Open alippai opened 3 weeks ago

alippai commented 3 weeks ago

Checks

Reproducible example

Both lazy and eager fails to raise error or evaluate the having condition

import polars as pl
pl.DataFrame({"a": [1]}).sql("""
    select sum(a) from self having sum(a) != 1
""")

Log output

With or without the HAVING condition (deleting the HAVING) it gives the same plan:

 SELECT [col("a").sum()] FROM
  DF ["a"]; PROJECT 1/1 COLUMNS; SELECTION: None

Issue description

HAVING condition is ignored both in lazy and eager modes

Expected behavior

HAVING is applied or error (NotImplemented) is raised

Installed versions

``` --------Version info--------- Polars: 1.9.0 Index type: UInt32 Platform: Linux Python: 3.10.14 | packaged by conda-forge | (main, Mar 20 2024, 12:45:18) [GCC 12.3.0] ```
alexander-beedie commented 3 weeks ago

Thanks for the heads-up. Should raise a syntax error here; we can't evaluate a HAVING outside of a GROUP BY (currently) πŸ€”

alippai commented 3 weeks ago

@alexander-beedie this happens with group by as well. Maybe I made the example too minimal πŸ˜…

alexander-beedie commented 3 weeks ago

@alexander-beedie this happens with group by as well. Maybe I made the example too minimal πŸ˜…

Odd - the HAVING clause is known to work; for example:

import polars as pl

df = pl.DataFrame({
    "key": ["aaa", "aaa", "bbb", "bbb", "bbb"],
    "value": [100, 25, -200, 50, 135],
})

df.sql("""
  SELECT 
    key,
    COUNT(key) AS n
  FROM self
  GROUP BY key
  HAVING n > 2
""")
# β”Œβ”€β”€β”€β”€β”€β”¬β”€β”€β”€β”€β”€β”
# β”‚ key ┆ n   β”‚
# β”‚ --- ┆ --- β”‚
# β”‚ str ┆ u32 β”‚
# β•žβ•β•β•β•β•β•ͺ═════║
# β”‚ bbb ┆ 3   β”‚
# β””β”€β”€β”€β”€β”€β”΄β”€β”€β”€β”€β”€β”˜

Got an example where it doesn't?

alippai commented 3 weeks ago

Let me explore this more:

  1. The query SELECT count(key) AS n FROM df HAVING n >1 supposed to work, it works in the SQL engines I tested
  2. I had a large example where HAVING was not applied
alexander-beedie commented 3 weeks ago
  1. The query SELECT count(key) AS n FROM df HAVING n >1 supposed to work, it works in the SQL engines I tested

This is one of those "some SQL engines allow it, some don't" cases, and is arguably better written (and understood) as a standard WHERE clause. I prefer that we raise a SQLSyntaxError here for now, addressing the current issue (where it is silently ignored), and we can consider revisiting this behaviour in the future πŸ€”

alippai commented 3 weeks ago

Agree that raising is a good temp solution, that nuance can be discussed in a separate issue (although I'd be surprised if this would be a question of flavor and not the ANSI SQL).

Good news, I've managed to reproduce my original issue. It's about the alias vs original expression usage:

import polars as pl

df = pl.DataFrame({
    "key": ["aaa", "aaa", "bbb", "bbb", "bbb"],
    "value": [100, 25, -200, 50, 135],
})

df.sql("""
  SELECT 
    key,
    COUNT(key) AS n
  FROM self
  GROUP BY key
  HAVING n > 2
""")

vs

import polars as pl

df = pl.DataFrame({
    "key": ["aaa", "aaa", "bbb", "bbb", "bbb"],
    "value": [100, 25, -200, 50, 135],
})

df.sql("""
  SELECT 
    key,
    COUNT(key) AS n
  FROM self
  GROUP BY key
  HAVING COUNT(key) > 2
""")
alippai commented 3 weeks ago

While experimenting with the above I came across a slightly related edge-case. Let me know if I should open a separate issue for it:

import polars as pl
pl.sql("select key, sum(value) as key from (select 1 as key, 2 as value) x group by 1").collect()

The above prints only one key column with no error message, sum(value) is lost.

In the end, maybe my other issue is related as well: https://github.com/pola-rs/polars/issues/19290

alexander-beedie commented 2 weeks ago

FYI: the latest release (a few hours ago) will now raise a helpful syntax error when seeing a bare HAVING (outside of a GROUP BY). I'll take a look at adding support it properly a little later. Have re-tagged this Issue as an enhancement now.

I'd be surprised if this would be a question of flavor and not the ANSI SQL

If there's one thing I've learned about SQL over the years, it's that it is always a question of flavour 🀣

A recent example I looked at, involving strftime formatting: (Desired datetime format: "%Y-%m-%d %H:%M:%S.%f")

Backend Syntax
Doris DATE_FORMAT(dt,'%Y-%m-%d %H:%i:%S.%f')
DuckDB STRFTIME(dt,'%Y-%m-%d %H:%M:%S.%f')
PostgreSQL TO_CHAR(dt,'YYYY-MM-DD HH24:MI:SS.US')
SQLite STRFTIME('%Y-%m-%d %H:%M:%f',dt)

Every function and every formatting string is different.

alippai commented 2 weeks ago

@alexander-beedie https://github.com/pola-rs/polars/issues/19310#issuecomment-2424188378 this sounds to be a common bug (regardless of the HAVING without a GROUP BY which is handled now).

The aliases in aggregate context won’t work:

  1. When the key has an alias key as x
  2. When the aggregated column is aliased to the original name sum(value) as value and using value in HAVING
  3. When the aggregated column is renamed to the key sum(value) as key
alippai commented 1 week ago

@alexander-beedie do you think any of the 3 bugs could be a good first issue to solve? If you point me into the right direction I could give it a try.