pola-rs / polars-cli

CLI interface for running SQL queries with Polars as backend
https://pola.rs/
MIT License
159 stars 12 forks source link

Inconsistent COUNT(*) and GROUP BY behavior in Polars CLI #71

Open kwkeefer opened 1 month ago

kwkeefer commented 1 month ago

Checks

Reproducible example

# generate test.csv
cat<<EOF > test.csv
a
test
test
test2
test3
EOF

# run group by query
echo "SELECT COUNT(*) AS _count, a FROM read_csv('test.csv') GROUP BY a;" | polars

Output

┌────────┬───────┐
│ _count ┆ a     │
│ ---    ┆ ---   │
│ u32    ┆ str   │
╞════════╪═══════╡
│ 3      ┆ test2 │
│ 3      ┆ test3 │
│ 3      ┆ test  │
└────────┴───────┘

Issue description

COUNT(*) is seemingly counting all rows, instead of using the group by.

Expected behavior

import polars as pl

df = pl.read_csv('test.csv')

with pl.SQLContext(register_globals=True, eager=True) as ctx:
    df_small = ctx.execute("SELECT COUNT(*) AS _count, a FROM df GROUP BY a")
    print(df_small)
python3 polarstest.py
shape: (3, 2)
┌────────┬───────┐
│ _count ┆ a     │
│ ---    ┆ ---   │
│ u32    ┆ str   │
╞════════╪═══════╡
│ 2      ┆ test  │
│ 1      ┆ test3 │
│ 1      ┆ test2 │
└────────┴───────┘

Installed version

0.8.0