pola-rs / polars

Dataframes powered by a multithreaded, vectorized query engine, written in Rust
https://docs.pola.rs
Other
29.58k stars 1.89k forks source link

Columnnotfounderror after join #9778

Closed arnabanimesh closed 1 year ago

arnabanimesh commented 1 year ago

Polars version checks

Issue description

Error when running the sample code: exceptions.ColumnNotFoundError: Idx

The issue doesn't occur when specifying lazyframe using dictionary. It occurs when reading csv file using scan_csv.

Sample datasets attached: a.csv b.csv

Reproducible example

import polars as pl

df = pl.scan_csv("a.csv").with_row_count("Idx")
sec_df = pl.scan_csv("b.csv").with_row_count("B Idx")
df = df.join(df,on="B")
print(df.collect())
grouped_df = df.groupby("A").all()
print(grouped_df.collect())

Expected behavior

The code should run

Installed versions

``` --------Version info--------- Polars: 0.18.6 Index type: UInt32 Platform: Windows-10-10.0.22621-SP0 Python: 3.11.4 (tags/v3.11.4:d2340ef, Jun 7 2023, 05:45:37) [MSC v.1934 64 bit (AMD64)] ----Optional dependencies---- adbc_driver_sqlite: connectorx: deltalake: fsspec: matplotlib: numpy: pandas: pyarrow: pydantic: sqlalchemy: xlsx2csv: xlsxwriter: ```
cmdlineluser commented 1 year ago

tempfile.NamedTemporaryFile can be used to inline your example:

import polars as pl
from tempfile import NamedTemporaryFile

csv_a = NamedTemporaryFile()
csv_a.write(b"""
A,B
Gr1,A
Gr1,B
""".strip())

csv_a.seek(0)

df_a = pl.scan_csv(csv_a.name).with_row_count("Idx")

df_a.join(df_a, on="B").collect()
# shape: (2, 5)
# ┌─────┬─────┬─────┬───────────┬─────────┐
# │ Idx ┆ A   ┆ B   ┆ Idx_right ┆ A_right │
# │ --- ┆ --- ┆ --- ┆ ---       ┆ ---     │
# │ u32 ┆ str ┆ str ┆ u32       ┆ str     │
# ╞═════╪═════╪═════╪═══════════╪═════════╡
# │ 0   ┆ Gr1 ┆ A   ┆ 0         ┆ Gr1     │
# │ 1   ┆ Gr1 ┆ B   ┆ 1         ┆ Gr1     │
# └─────┴─────┴─────┴───────────┴─────────┘

df_a.join(df_a, on="B").groupby("A").all().collect()
# ColumnNotFoundError: Idx

Oddly enough if you select just the Idx column on its own, it's there

df_a.join(df_a, on="B").select("Idx").collect()
# shape: (2, 1)
# ┌─────┐
# │ Idx │
# │ --- │
# │ u32 │
# ╞═════╡
# │ 0   │
# │ 1   │
# └─────┘

df_a.join(df_a, on="B").select("Idx", "A").collect()
# ColumnNotFoundError: Idx
avimallu commented 1 year ago

The example that @cmdlineluser provided works just fine in Polars 0.18.4 and stopped working from 0.18.5:

>>> import polars as pl
>>> from tempfile import NamedTemporaryFile
>>> 
>>> csv_a = NamedTemporaryFile()
>>> csv_a.write(b"""
... A,B
... Gr1,A
... Gr1,B
... """.strip())
15
>>> 
>>> csv_a.seek(0)
0
>>> 
>>> df_a = pl.scan_csv(csv_a.name).with_row_count("Idx")
>>> df_a.join(df_a, on="B").groupby("A").all().collect()
shape: (1, 5)
┌─────┬───────────┬────────────┬───────────┬────────────────┐
│ A   ┆ Idx       ┆ B          ┆ Idx_right ┆ A_right        │
│ --- ┆ ---       ┆ ---        ┆ ---       ┆ ---            │
│ str ┆ list[u32] ┆ list[str]  ┆ list[u32] ┆ list[str]      │
╞═════╪═══════════╪════════════╪═══════════╪════════════════╡
│ Gr1 ┆ [0, 1]    ┆ ["A", "B"] ┆ [0, 1]    ┆ ["Gr1", "Gr1"] │
└─────┴───────────┴────────────┴───────────┴────────────────┘
>>> pl.show_versions()
--------Version info---------
Polars:      0.18.4
Index type:  UInt32
Platform:    macOS-13.4.1-arm64-arm-64bit
Python:      3.10.9 (main, Jan 11 2023, 09:18:18) [Clang 14.0.6 ]

----Optional dependencies----
numpy:       1.24.3
pandas:      1.5.3
pyarrow:     11.0.0
connectorx:  0.3.1
deltalake:   0.10.0
fsspec:      2023.4.0
matplotlib:  3.7.1
xlsx2csv:    0.8.1
xlsxwriter:  3.0.9
avimallu commented 1 year ago

works just fine in Polars 0.18.4 and stopped working from 0.18.5:

Some git bisect sleuthing says that https://github.com/pola-rs/polars/pull/9700 is the cause of this error.