pola-rs / polars

Dataframes powered by a multithreaded, vectorized query engine, written in Rust
https://docs.pola.rs
Other
30.32k stars 1.96k forks source link

Column type of `scan_parquet` `include_file_paths`. #19539

Open kcajf opened 2 weeks ago

kcajf commented 2 weeks ago

Description

Currently, the column given in include_file_paths is of type pl.String. I think, given the low cardinality, it might be better for it to be of type pl.Categorical?

Elaborating on my usecases:

I've got a list of file paths, containing path components that I want to eventually have as columns (e.g. blahblah/col_a/col_b/file.parquet). These aren't in a schema that polars recognises (and if it did, I still don't want to glob the entire tree, because I only want to load a specific subset of files). As a result, I'm using include_file_paths to preserve the file path, then creating some new columns that apply the file path through some mapping (Dict[str, str]). I think this would be faster if the column were categorical (besides the general performance problem reported here https://github.com/pola-rs/polars/issues/19538)

coastalwhite commented 1 week ago

I like this idea, although it would probably not happen any time soon. It is quite a breaking change.

Note that, the low cardinality of include_file_paths is mostly caught by the fact that pl.String is actually a StringView which only contains each path once. So I think the savings of a categorical would be rather minimal, if not completely nullified by the inherent overhead of a categorical.

I would personally prefer to have some sort of internal partitioned column that can handle this transparently.

kcajf commented 1 week ago

Thanks, that's useful to know.

The most common action I'm doing on this column is .replace_strict(str2strdict, return_type=pl.Categorical). Do you expect that would be significantly faster with categoricals? I would imagine that replace on Categoricals could actually be ~constant-time and zero-copy?