Open alippai opened 3 weeks ago
Thanks for the heads-up.
Should raise a syntax error here; we can't evaluate a HAVING
outside of a GROUP BY
(currently) π€
@alexander-beedie this happens with group by as well. Maybe I made the example too minimal π
@alexander-beedie this happens with group by as well. Maybe I made the example too minimal π
Odd - the HAVING
clause is known to work; for example:
import polars as pl
df = pl.DataFrame({
"key": ["aaa", "aaa", "bbb", "bbb", "bbb"],
"value": [100, 25, -200, 50, 135],
})
df.sql("""
SELECT
key,
COUNT(key) AS n
FROM self
GROUP BY key
HAVING n > 2
""")
# βββββββ¬ββββββ
# β key β n β
# β --- β --- β
# β str β u32 β
# βββββββͺββββββ‘
# β bbb β 3 β
# βββββββ΄ββββββ
Got an example where it doesn't?
Let me explore this more:
SELECT count(key) AS n FROM df HAVING n >1
supposed to work, it works in the SQL engines I tested
- The query
SELECT count(key) AS n FROM df HAVING n >1
supposed to work, it works in the SQL engines I tested
This is one of those "some SQL engines allow it, some don't" cases, and is arguably better written (and understood) as a standard WHERE clause. I prefer that we raise a SQLSyntaxError
here for now, addressing the current issue (where it is silently ignored), and we can consider revisiting this behaviour in the future π€
Agree that raising is a good temp solution, that nuance can be discussed in a separate issue (although I'd be surprised if this would be a question of flavor and not the ANSI SQL).
Good news, I've managed to reproduce my original issue. It's about the alias vs original expression usage:
import polars as pl
df = pl.DataFrame({
"key": ["aaa", "aaa", "bbb", "bbb", "bbb"],
"value": [100, 25, -200, 50, 135],
})
df.sql("""
SELECT
key,
COUNT(key) AS n
FROM self
GROUP BY key
HAVING n > 2
""")
vs
import polars as pl
df = pl.DataFrame({
"key": ["aaa", "aaa", "bbb", "bbb", "bbb"],
"value": [100, 25, -200, 50, 135],
})
df.sql("""
SELECT
key,
COUNT(key) AS n
FROM self
GROUP BY key
HAVING COUNT(key) > 2
""")
While experimenting with the above I came across a slightly related edge-case. Let me know if I should open a separate issue for it:
import polars as pl
pl.sql("select key, sum(value) as key from (select 1 as key, 2 as value) x group by 1").collect()
The above prints only one key
column with no error message, sum(value)
is lost.
In the end, maybe my other issue is related as well: https://github.com/pola-rs/polars/issues/19290
FYI: the latest release (a few hours ago) will now raise a helpful syntax error when seeing a bare HAVING (outside of a GROUP BY). I'll take a look at adding support it properly a little later. Have re-tagged this Issue as an enhancement now.
I'd be surprised if this would be a question of flavor and not the ANSI SQL
If there's one thing I've learned about SQL over the years, it's that it is always a question of flavour π€£
A recent example I looked at, involving strftime formatting: (Desired datetime format: "%Y-%m-%d %H:%M:%S.%f")
Backend | Syntax |
---|---|
Doris | DATE_FORMAT(dt,'%Y-%m-%d %H:%i:%S.%f') |
DuckDB | STRFTIME(dt,'%Y-%m-%d %H:%M:%S.%f') |
PostgreSQL | TO_CHAR(dt,'YYYY-MM-DD HH24:MI:SS.US') |
SQLite | STRFTIME('%Y-%m-%d %H:%M:%f',dt) |
Every function and every formatting string is different.
@alexander-beedie https://github.com/pola-rs/polars/issues/19310#issuecomment-2424188378 this sounds to be a common bug (regardless of the HAVING without a GROUP BY which is handled now).
The aliases in aggregate context wonβt work:
key as x
sum(value) as value
and using value
in HAVINGsum(value) as key
@alexander-beedie do you think any of the 3 bugs could be a good first issue to solve? If you point me into the right direction I could give it a try.
Checks
Reproducible example
Both lazy and eager fails to raise error or evaluate the
having
conditionLog output
With or without the
HAVING
condition (deleting theHAVING
) it gives the same plan:Issue description
HAVING
condition is ignored both in lazy and eager modesExpected behavior
HAVING
is applied or error (NotImplemented
) is raisedInstalled versions