Open kcajf opened 2 weeks ago
I like this idea, although it would probably not happen any time soon. It is quite a breaking change.
Note that, the low cardinality of include_file_paths
is mostly caught by the fact that pl.String
is actually a StringView
which only contains each path once. So I think the savings of a categorical would be rather minimal, if not completely nullified by the inherent overhead of a categorical.
I would personally prefer to have some sort of internal partitioned column that can handle this transparently.
Thanks, that's useful to know.
The most common action I'm doing on this column is .replace_strict(str2strdict, return_type=pl.Categorical)
. Do you expect that would be significantly faster with categoricals? I would imagine that replace
on Categoricals could actually be ~constant-time and zero-copy?
Description
Currently, the column given in
include_file_paths
is of typepl.String
. I think, given the low cardinality, it might be better for it to be of typepl.Categorical
?Elaborating on my usecases:
I've got a list of file paths, containing path components that I want to eventually have as columns (e.g.
blahblah/col_a/col_b/file.parquet
). These aren't in a schema that polars recognises (and if it did, I still don't want to glob the entire tree, because I only want to load a specific subset of files). As a result, I'm usinginclude_file_paths
to preserve the file path, then creating some new columns that apply the file path through some mapping (Dict[str, str]
). I think this would be faster if the column were categorical (besides the general performance problem reported here https://github.com/pola-rs/polars/issues/19538)