Open lau-aab opened 1 year ago
It's not 100% clear to me that the solution here is to silence the warning, as an alternative perhaps we can make casts to pl.Utf8
call the __str__
as a fallback if no more efficient method is available, which would make the code suggested by the warning work.
You can also use repr
, as you are not casting, and repr
on Objects (or anything else) won't trigger the "inefficient apply" warning, eg:
pl.col("column_0").apply( repr, return_dtype=pl.Utf8 )
Update: Aahhh... this won't quite result in the same return for UUID values.
The simplest solution is to use str()
when generating your dataframe:
df = pl.DataFrame(
[(str(uuid.uuid4()), f"row_{i}") for i in range(10)],
orient="row"
)
shape: (10, 2)
┌───────────────────────────────────┬──────────┐
│ column_0 ┆ column_1 │
│ --- ┆ --- │
│ str ┆ str │
╞═══════════════════════════════════╪══════════╡
│ ce104fc5-f74f-4bb6-8e87-943f7cce… ┆ row_0 │
│ 53072ab2-6d25-46ac-8ab1-41ad5191… ┆ row_1 │
│ 83e4b412-9dd8-4636-a72a-7ab2bd2e… ┆ row_2 │
│ 944a1832-2b9c-4036-b89e-eb962aca… ┆ row_3 │
│ … ┆ … │
│ 2fdfc508-fe42-40a3-95d9-6a89c449… ┆ row_6 │
│ a7cf4709-5280-47b2-b881-77bfcd8f… ┆ row_7 │
│ be70e823-fcc9-4f7a-bb24-d29cc29d… ┆ row_8 │
│ e0ea8906-ad0a-4a9b-a13c-4cacc4f9… ┆ row_9 │
└───────────────────────────────────┴──────────┘
...but yeah, I don't think we should perhaps make the warning original dtype-dependent.
The simplest solution is to use
str()
when generating your dataframe:
Wouldn't this also be the most efficient solution? I'm assuming that for large DataFrames, calling Python row-by-row would generate a large overhead.
There's of course workarounds possible (e.g. df.with_columns(pl.col("column_0").apply(lambda x: x if isinstance(x, str) else str(x), return_dtype=pl.Utf8))
), but the point is that the suggestion just isn't really correct here. It seems to be that it's really hard to ensure these warnings will never produce false positives, so personally I would like to simply have a way to opt out of PolarsInefficientApplyWarning
s altogether, either globally or for a specific apply.
The way I resolved our immediate issue (namely that our tests broke), was by ignoring this warning in pytest:
[tool.pytest.ini_options]
filterwarnings = [
"error",
"ignore::polars.exceptions.PolarsInefficientApplyWarning",
]
But IMO the usage of apply is simply valid here, and I think there should be a way to prevent triggering this warning.
The simplest solution is to use
str()
when generating your dataframe:Wouldn't this also be the most efficient solution? I'm assuming that for large DataFrames, calling Python row-by-row would generate a large overhead.
As to using str()
when creating the dataframe: this is indeed a solution for the simplified sample code, but is not practical for the code where we actually encountered this issue: we have a function convert_column_types(df)
which is called from different places on different dataframes, and ensures that the columns in df
have consistent types. If we have the convert the column before creation, this means we have to do that in all those places.
I imagine there are many more scenarios where it it impractical to convert the data before creating the dataframe.
the suggestion just isn't really correct here
In the sense that this is an inefficient use of polars it arguably is, but still... point taken ;)
I think we should continue with the warning here, but in conjunction with @orlp's suggestion to support .cast(pl.Utf8)
for Object
dtype data (likely via an internal call to str
against the underlying PyObject).
I'd expect that the vast majority of the time the warning is valid, but we should still support this case so that the suggested (and more idiomatic) syntax is always valid. (It may also be that handling this internally is slightly more efficient; at any rate, it won't be any worse).
@lau-aab you can filter out warnings by including this in your code:
import warnings
import polars as pl
warnings.filterwarnings(
action="ignore",
category=pl.exceptions.PolarsInefficientApplyWarning
)
Observed this issue for map_elements
as well; also for UUIDs. Should I file a separate issue? Guessing it's not a big surprise to anyone.
Running the code below:
transformed = (
df.lazy()
.with_columns(
register=pl.struct(
name=pl.col("register").map_elements(lambda r: r.name),
id=pl.col("register").map_elements(lambda r: r.id, return_dtype=pl.UInt8),
),
customer_id=pl.col("customer_id").map_elements(str), # <-- here's the problem
)
.collect()
)
I get:
Expr.map_elements is significantly slower than the native expressions API.
Only use if you absolutely CANNOT implement your logic otherwise.
In this case, you can replace your `map_elements` with the following:
- pl.col("customer_id").map_elements(str)
+ pl.col("customer_id").cast(pl.Utf8)
Should I file a separate issue?
apply has been renamed to map-elements, so this issue is fine :)
Checks
[X] I have checked that this issue has not already been reported.
[X] I have confirmed this bug exists on the latest version of Polars.
Reproducible example
Issue description
When converting a column containing uuids to string using
apply(str)
, polars raises aPolarsInefficientApplyWarning
exception, suggesting to usecast
instead:When implementing this suggestion, the code fails with a
ComputeError
exception.Expected behavior
It should be possible run run
.apply(str)
without this triggering aPolarsInefficientApplyWarning
exception. For example by a parameter inapply
to disable checks on inefficient applies.Installed versions