pola-rs / polars

Dataframes powered by a multithreaded, vectorized query engine, written in Rust
https://docs.pola.rs
Other
29.27k stars 1.85k forks source link

`from_pandas` incorrectly handles object columns in some cases #10968

Open Marmeladenbrot opened 1 year ago

Marmeladenbrot commented 1 year ago

Problem description

I use polars 0.19.1 and Python 3.11.5.

I have a script that takes a query, loads that into a pandas dataframe, convert to a polars dataframe and then check if any column contains a specific value and returns those columns for data quality checks.

In this case I load data via SQL query into a pandas dataframe (library only supports an "export_to_pandas" function).

df = c.export_to_pandas('''SELECT * FROM t''')

This dataframe has a column A which contains numbers in the beginning but also values like "D - 12345". In the pandas dataframe this column has type "object".

When I try to convert this dataframe to polars via df = pl.from_pandas(df)

this results in a

ArrowInvalid: Could not convert '58708' with type str: tried to convert to int64

because apparently polars scans the first N rows, sees only numbers and some NULL values and decides that this column is int64 but then it fails because "D - 12345" is no int64 and the column should stay "object" / "Utf8".

I don't see any way in https://pola-rs.github.io/polars/py-polars/html/reference/api/polars.from_pandas.html to use something like "infer_schema_length" like stated in https://github.com/pola-rs/polars/issues/4489 (but it seems this did not cover the from_pandas function) and be able to set it to max because the query changes everytime and I can't hardcode anything.

ritchie46 commented 1 year ago

Pandas Series are typed and if the type is object we should respect that. We are not inferring object columns.

I don't understand why we try to read that column as integer. 🤔

So, I do want to fix the bug, but don't want to add schema inference on typed series.

stinodego commented 1 year ago

@Marmeladenbrot could you provide a reproducible example of this error? I tried the following and couldn't replicate it:

import pandas as pd
import polars as pl
from polars.testing import assert_frame_equal

data = {"a": ["1", "2", "3"]}
df = pd.DataFrame(data)

result = pl.from_pandas(df)

expected = pl.DataFrame(data)
assert_frame_equal(result, expected)  # passes
orlp commented 1 year ago

@stinodego

I can reproduce with the following

import pandas as pd
import polars as pl

df = pd.DataFrame({"a": [1, "x"]})
pl.from_pandas(df)
romanovacca commented 10 months ago

This error occurs in _pandas_series_to_arrow function.

  dtype = getattr(values, "dtype", None)
  if dtype == "object":
      first_non_none = _get_first_non_none(values.values)  # type: ignore[arg-type]
      if isinstance(first_non_none, str):
          return pa.array(values, pa.large_utf8(), from_pandas=nan_to_null)
      elif first_non_none is None:
          return pa.nulls(length or len(values), pa.large_utf8())
      return pa.array(values, from_pandas=nan_to_null)

As you can see with the MRE from @orlp, we skip the if because the first non none value is an int, the elif is also skipped.

The final return runs pa.array(values, from_pandas=nan_to_null) which fails with the error pyarrow.lib.ArrowInvalid: Could not convert 'x' with type str: tried to convert to int64 .

So we have to find a solution of handling multiple datatypes within a series. We could cast the series to string before passing it to pyarrow as a final condition, but that will have some implications, since the default behaviour now is to assume it's all the same datatype.

Marmeladenbrot commented 10 months ago

So we have to find a solution of handling multiple datatypes within a series.

But the column already has a datatype in pandas, where polars is reading from? In the pandas dataframe this column has type "object".

bionicles commented 3 weeks ago

hi all, here's a repro of a similar issue I'm facing now.

TLDR: if you make a polars dataframe with a column of lambdas, it works fine, but if you do a round trip to pandas and back, it crashes on the way back.

The reason I need to do a round trip to pandas and back, is the pandas DataFrame.sample method accepts a "weights" parameter whereas the polars equivalent does not. If not for the inability to perform weighted sampling of rows from Polars DataFrames, then I would not need to perform this conversion.

In this case, no inference would be needed, only "not crashing."

Please let me know if I ought to open a separate issue.

import pandas as pd
import polars as pl

def build_data() -> dict[str, tuple]:
    data = {"index": (1, 2), "fn": (lambda x: x + 1, lambda x: x / 2)}
    return data

def build_pd_df() -> pd.DataFrame:
    data = build_data()
    pd_df = pd.DataFrame(data)
    return pd_df

def pl_from_pd(pd_df: pd.DataFrame) -> pl.DataFrame:
    return pl.DataFrame(pd_df)

def build_pl_df() -> pl.DataFrame:
    data = build_data()
    pl_df = pl.DataFrame(data)
    return pl_df

def test_build_pl_df_directly():
    pl_df = build_pl_df()
    print(pl_df)
    print("test_build_pl_df_directly is Ok!")

def test_pl_from_pd():
    pd_df = build_pd_df()
    try:
        pl_df = pl_from_pd(pd_df)
    except Exception as e:
        print(f"Err({e})")
        raise e
    print(pl_df)
    print("Ok!")

print("=== TEST DIRECT BUILDING DF WITH COLUMN OF LAMBDAS ===")
test_build_pl_df_directly()
print("=== TEST CONVERSION OF SAME DATA FROM PANDAS ===")
test_pl_from_pd()

result:

=== TEST DIRECT BUILDING DF WITH COLUMN OF LAMBDAS ===
shape: (2, 2)
┌───────┬─────────────────────────────────┐
│ index ┆ fn                              │
│ ---   ┆ ---                             │
│ i64   ┆ object                          │
╞═══════╪═════════════════════════════════╡
│ 1     ┆ <function build_data.<locals>.… │
│ 2     ┆ <function build_data.<locals>.… │
└───────┴─────────────────────────────────┘
test_build_pl_df_directly is Ok!
=== TEST CONVERSION OF SAME DATA FROM PANDAS ===
Err(Could not convert <function build_data.<locals>.<lambda> at 0x7f95a567ba60> with type function: did not recognize Python value type when inferring an Arrow data type

traceback: image