pola-rs / polars

Dataframes powered by a multithreaded, vectorized query engine, written in Rust
https://docs.pola.rs
Other
30.22k stars 1.95k forks source link

False positive of PolarsInefficientApplyWarning on apply(str) #10269

Open lau-aab opened 1 year ago

lau-aab commented 1 year ago

Checks

Reproducible example

import polars as pl
import uuid

data = [(uuid.uuid4(), f"row_{i}") for i in range(10)]

df = pl.DataFrame(data)
print(df)

apply_str_df = df.with_columns(
    pl.col("column_0").apply(str, return_dtype=pl.Utf8),  # Converts to uuids into strings
)
print(apply_str_df)

cast_str_df = df.with_columns(
    pl.col("column_0").cast(pl.Utf8),  # Raises exception: exceptions.ComputeError: cannot cast 'Object' type
)
print(cast_str_df)

Issue description

When converting a column containing uuids to string using apply(str), polars raises a PolarsInefficientApplyWarning exception, suggesting to use cast instead:

In this case, you can replace your `apply` with the following:
  - pl.col("column_0").apply(str)
  + pl.col("column_0").cast(pl.Utf8)

When implementing this suggestion, the code fails with a ComputeError exception.

exceptions.ComputeError: cannot cast 'Object' type

Expected behavior

It should be possible run run .apply(str) without this triggering a PolarsInefficientApplyWarning exception. For example by a parameter in apply to disable checks on inefficient applies.

Installed versions

``` --------Version info--------- Polars: 0.18.11 Index type: UInt32 Platform: Linux-5.10.16.3-microsoft-standard-WSL2-x86_64-with-glibc2.35 Python: 3.10.6 (main, Mar 10 2023, 10:55:28) [GCC 11.3.0] ----Optional dependencies---- adbc_driver_sqlite: cloudpickle: connectorx: deltalake: fsspec: matplotlib: numpy: pandas: pyarrow: pydantic: sqlalchemy: xlsx2csv: xlsxwriter: ```
orlp commented 1 year ago

It's not 100% clear to me that the solution here is to silence the warning, as an alternative perhaps we can make casts to pl.Utf8 call the __str__ as a fallback if no more efficient method is available, which would make the code suggested by the warning work.

alexander-beedie commented 1 year ago

You can also use repr, as you are not casting, and repr on Objects (or anything else) won't trigger the "inefficient apply" warning, eg:

pl.col("column_0").apply( repr, return_dtype=pl.Utf8 )

Update: Aahhh... this won't quite result in the same return for UUID values.

mcrumiller commented 1 year ago

The simplest solution is to use str() when generating your dataframe:

df = pl.DataFrame(
    [(str(uuid.uuid4()), f"row_{i}") for i in range(10)],
    orient="row"
)
shape: (10, 2)
┌───────────────────────────────────┬──────────┐
│ column_0                          ┆ column_1 │
│ ---                               ┆ ---      │
│ str                               ┆ str      │
╞═══════════════════════════════════╪══════════╡
│ ce104fc5-f74f-4bb6-8e87-943f7cce… ┆ row_0    │
│ 53072ab2-6d25-46ac-8ab1-41ad5191… ┆ row_1    │
│ 83e4b412-9dd8-4636-a72a-7ab2bd2e… ┆ row_2    │
│ 944a1832-2b9c-4036-b89e-eb962aca… ┆ row_3    │
│ …                                 ┆ …        │
│ 2fdfc508-fe42-40a3-95d9-6a89c449… ┆ row_6    │
│ a7cf4709-5280-47b2-b881-77bfcd8f… ┆ row_7    │
│ be70e823-fcc9-4f7a-bb24-d29cc29d… ┆ row_8    │
│ e0ea8906-ad0a-4a9b-a13c-4cacc4f9… ┆ row_9    │
└───────────────────────────────────┴──────────┘

...but yeah, I don't think we should perhaps make the warning original dtype-dependent.

avimallu commented 1 year ago

The simplest solution is to use str() when generating your dataframe:

Wouldn't this also be the most efficient solution? I'm assuming that for large DataFrames, calling Python row-by-row would generate a large overhead.

lau-aab commented 1 year ago

There's of course workarounds possible (e.g. df.with_columns(pl.col("column_0").apply(lambda x: x if isinstance(x, str) else str(x), return_dtype=pl.Utf8))), but the point is that the suggestion just isn't really correct here. It seems to be that it's really hard to ensure these warnings will never produce false positives, so personally I would like to simply have a way to opt out of PolarsInefficientApplyWarnings altogether, either globally or for a specific apply.

The way I resolved our immediate issue (namely that our tests broke), was by ignoring this warning in pytest:

[tool.pytest.ini_options]
filterwarnings = [
    "error",
    "ignore::polars.exceptions.PolarsInefficientApplyWarning",
]

But IMO the usage of apply is simply valid here, and I think there should be a way to prevent triggering this warning.

lau-aab commented 1 year ago

The simplest solution is to use str() when generating your dataframe:

Wouldn't this also be the most efficient solution? I'm assuming that for large DataFrames, calling Python row-by-row would generate a large overhead.

As to using str() when creating the dataframe: this is indeed a solution for the simplified sample code, but is not practical for the code where we actually encountered this issue: we have a function convert_column_types(df) which is called from different places on different dataframes, and ensures that the columns in df have consistent types. If we have the convert the column before creation, this means we have to do that in all those places.

I imagine there are many more scenarios where it it impractical to convert the data before creating the dataframe.

alexander-beedie commented 1 year ago

the suggestion just isn't really correct here

In the sense that this is an inefficient use of polars it arguably is, but still... point taken ;)

I think we should continue with the warning here, but in conjunction with @orlp's suggestion to support .cast(pl.Utf8) for Object dtype data (likely via an internal call to str against the underlying PyObject).

I'd expect that the vast majority of the time the warning is valid, but we should still support this case so that the suggested (and more idiomatic) syntax is always valid. (It may also be that handling this internally is slightly more efficient; at any rate, it won't be any worse).

mcrumiller commented 1 year ago

@lau-aab you can filter out warnings by including this in your code:

import warnings
import polars as pl

warnings.filterwarnings(
    action="ignore",
    category=pl.exceptions.PolarsInefficientApplyWarning
)
sfrk commented 1 year ago

Observed this issue for map_elements as well; also for UUIDs. Should I file a separate issue? Guessing it's not a big surprise to anyone.

Running the code below:

transformed = (
    df.lazy()
    .with_columns(
        register=pl.struct(
            name=pl.col("register").map_elements(lambda r: r.name),
            id=pl.col("register").map_elements(lambda r: r.id, return_dtype=pl.UInt8),
        ),
        customer_id=pl.col("customer_id").map_elements(str), # <-- here's the problem
    )
    .collect()
)

I get:

Expr.map_elements is significantly slower than the native expressions API.
Only use if you absolutely CANNOT implement your logic otherwise.
In this case, you can replace your `map_elements` with the following:
  - pl.col("customer_id").map_elements(str)
  + pl.col("customer_id").cast(pl.Utf8)
MarcoGorelli commented 1 year ago

Should I file a separate issue?

apply has been renamed to map-elements, so this issue is fine :)