Closed arnabanimesh closed 1 year ago
tempfile.NamedTemporaryFile can be used to inline your example:
import polars as pl
from tempfile import NamedTemporaryFile
csv_a = NamedTemporaryFile()
csv_a.write(b"""
A,B
Gr1,A
Gr1,B
""".strip())
csv_a.seek(0)
df_a = pl.scan_csv(csv_a.name).with_row_count("Idx")
df_a.join(df_a, on="B").collect()
# shape: (2, 5)
# ┌─────┬─────┬─────┬───────────┬─────────┐
# │ Idx ┆ A ┆ B ┆ Idx_right ┆ A_right │
# │ --- ┆ --- ┆ --- ┆ --- ┆ --- │
# │ u32 ┆ str ┆ str ┆ u32 ┆ str │
# ╞═════╪═════╪═════╪═══════════╪═════════╡
# │ 0 ┆ Gr1 ┆ A ┆ 0 ┆ Gr1 │
# │ 1 ┆ Gr1 ┆ B ┆ 1 ┆ Gr1 │
# └─────┴─────┴─────┴───────────┴─────────┘
df_a.join(df_a, on="B").groupby("A").all().collect()
# ColumnNotFoundError: Idx
Oddly enough if you select just the Idx
column on its own, it's there
df_a.join(df_a, on="B").select("Idx").collect()
# shape: (2, 1)
# ┌─────┐
# │ Idx │
# │ --- │
# │ u32 │
# ╞═════╡
# │ 0 │
# │ 1 │
# └─────┘
df_a.join(df_a, on="B").select("Idx", "A").collect()
# ColumnNotFoundError: Idx
The example that @cmdlineluser provided works just fine in Polars 0.18.4 and stopped working from 0.18.5:
>>> import polars as pl
>>> from tempfile import NamedTemporaryFile
>>>
>>> csv_a = NamedTemporaryFile()
>>> csv_a.write(b"""
... A,B
... Gr1,A
... Gr1,B
... """.strip())
15
>>>
>>> csv_a.seek(0)
0
>>>
>>> df_a = pl.scan_csv(csv_a.name).with_row_count("Idx")
>>> df_a.join(df_a, on="B").groupby("A").all().collect()
shape: (1, 5)
┌─────┬───────────┬────────────┬───────────┬────────────────┐
│ A ┆ Idx ┆ B ┆ Idx_right ┆ A_right │
│ --- ┆ --- ┆ --- ┆ --- ┆ --- │
│ str ┆ list[u32] ┆ list[str] ┆ list[u32] ┆ list[str] │
╞═════╪═══════════╪════════════╪═══════════╪════════════════╡
│ Gr1 ┆ [0, 1] ┆ ["A", "B"] ┆ [0, 1] ┆ ["Gr1", "Gr1"] │
└─────┴───────────┴────────────┴───────────┴────────────────┘
>>> pl.show_versions()
--------Version info---------
Polars: 0.18.4
Index type: UInt32
Platform: macOS-13.4.1-arm64-arm-64bit
Python: 3.10.9 (main, Jan 11 2023, 09:18:18) [Clang 14.0.6 ]
----Optional dependencies----
numpy: 1.24.3
pandas: 1.5.3
pyarrow: 11.0.0
connectorx: 0.3.1
deltalake: 0.10.0
fsspec: 2023.4.0
matplotlib: 3.7.1
xlsx2csv: 0.8.1
xlsxwriter: 3.0.9
works just fine in Polars 0.18.4 and stopped working from 0.18.5:
Some git bisect
sleuthing says that https://github.com/pola-rs/polars/pull/9700 is the cause of this error.
Polars version checks
[X] I have checked that this issue has not already been reported.
[X] I have confirmed this bug exists on the latest version of Polars.
Issue description
Error when running the sample code:
exceptions.ColumnNotFoundError: Idx
The issue doesn't occur when specifying lazyframe using dictionary. It occurs when reading csv file using
scan_csv
.Sample datasets attached: a.csv b.csv
Reproducible example
Expected behavior
The code should run
Installed versions