Open mkleinbort-wl opened 1 week ago
include_file_paths
was added for most of the formats: https://github.com/pola-rs/polars/pull/17563
>>> pl.scan_csv("*.csv", include_file_paths="filename").collect()
shape: (2, 4)
┌─────┬─────┬─────┬──────────┐
│ a ┆ b ┆ c ┆ filename │
│ --- ┆ --- ┆ --- ┆ --- │
│ i64 ┆ i64 ┆ i64 ┆ str │
╞═════╪═════╪═════╪══════════╡
│ 1 ┆ 2 ┆ 3 ┆ a.csv │
│ 4 ┆ 5 ┆ 6 ┆ b.csv │
└─────┴─────┴─────┴──────────┘
Seems it just needs to be exposed via read_csv
Thank you, I was on an old version of Polars and had not noticed. Adding it to the eager methods would be nice.
Yes, let's expose this to the eager methods a well.
This would greatly benefit from using a categorical for the include_file_paths
columns, no? Presumably the number of records is typically much greater than the number of files.
Trying to tackle this one
Description
It is ocasionaly true that the filename of a data file is fairly critical information
Illustratively
When using glob patterns to read this data, the file name itself is lost - which all but forces the user to loop over the files and read them manually.
A parameter to add a column with the specific file name when reading data via a glob pattern would be a nice to have.