Open Marmeladenbrot opened 1 year ago
Pandas Series are typed and if the type is object we should respect that. We are not inferring object columns.
I don't understand why we try to read that column as integer. 🤔
So, I do want to fix the bug, but don't want to add schema inference on typed series.
@Marmeladenbrot could you provide a reproducible example of this error? I tried the following and couldn't replicate it:
import pandas as pd
import polars as pl
from polars.testing import assert_frame_equal
data = {"a": ["1", "2", "3"]}
df = pd.DataFrame(data)
result = pl.from_pandas(df)
expected = pl.DataFrame(data)
assert_frame_equal(result, expected) # passes
@stinodego
I can reproduce with the following
import pandas as pd
import polars as pl
df = pd.DataFrame({"a": [1, "x"]})
pl.from_pandas(df)
This error occurs in _pandas_series_to_arrow
function.
dtype = getattr(values, "dtype", None)
if dtype == "object":
first_non_none = _get_first_non_none(values.values) # type: ignore[arg-type]
if isinstance(first_non_none, str):
return pa.array(values, pa.large_utf8(), from_pandas=nan_to_null)
elif first_non_none is None:
return pa.nulls(length or len(values), pa.large_utf8())
return pa.array(values, from_pandas=nan_to_null)
As you can see with the MRE from @orlp, we skip the if
because the first non none value is an int
, the elif
is also skipped.
The final return
runs pa.array(values, from_pandas=nan_to_null)
which fails with the error pyarrow.lib.ArrowInvalid: Could not convert 'x' with type str: tried to convert to int64
.
So we have to find a solution of handling multiple datatypes within a series. We could cast the series to string before passing it to pyarrow as a final condition, but that will have some implications, since the default behaviour now is to assume it's all the same datatype.
So we have to find a solution of handling multiple datatypes within a series.
But the column already has a datatype in pandas, where polars is reading from?
In the pandas dataframe this column has type "object".
hi all, here's a repro of a similar issue I'm facing now.
TLDR: if you make a polars dataframe with a column of lambdas, it works fine, but if you do a round trip to pandas and back, it crashes on the way back.
The reason I need to do a round trip to pandas and back, is the pandas DataFrame.sample
method accepts a "weights" parameter whereas the polars equivalent does not. If not for the inability to perform weighted sampling of rows from Polars DataFrames, then I would not need to perform this conversion.
In this case, no inference would be needed, only "not crashing."
Please let me know if I ought to open a separate issue.
import pandas as pd
import polars as pl
def build_data() -> dict[str, tuple]:
data = {"index": (1, 2), "fn": (lambda x: x + 1, lambda x: x / 2)}
return data
def build_pd_df() -> pd.DataFrame:
data = build_data()
pd_df = pd.DataFrame(data)
return pd_df
def pl_from_pd(pd_df: pd.DataFrame) -> pl.DataFrame:
return pl.DataFrame(pd_df)
def build_pl_df() -> pl.DataFrame:
data = build_data()
pl_df = pl.DataFrame(data)
return pl_df
def test_build_pl_df_directly():
pl_df = build_pl_df()
print(pl_df)
print("test_build_pl_df_directly is Ok!")
def test_pl_from_pd():
pd_df = build_pd_df()
try:
pl_df = pl_from_pd(pd_df)
except Exception as e:
print(f"Err({e})")
raise e
print(pl_df)
print("Ok!")
print("=== TEST DIRECT BUILDING DF WITH COLUMN OF LAMBDAS ===")
test_build_pl_df_directly()
print("=== TEST CONVERSION OF SAME DATA FROM PANDAS ===")
test_pl_from_pd()
result:
=== TEST DIRECT BUILDING DF WITH COLUMN OF LAMBDAS ===
shape: (2, 2)
┌───────┬─────────────────────────────────┐
│ index ┆ fn │
│ --- ┆ --- │
│ i64 ┆ object │
╞═══════╪═════════════════════════════════╡
│ 1 ┆ <function build_data.<locals>.… │
│ 2 ┆ <function build_data.<locals>.… │
└───────┴─────────────────────────────────┘
test_build_pl_df_directly is Ok!
=== TEST CONVERSION OF SAME DATA FROM PANDAS ===
Err(Could not convert <function build_data.<locals>.<lambda> at 0x7f95a567ba60> with type function: did not recognize Python value type when inferring an Arrow data type
traceback:
Problem description
I use polars 0.19.1 and Python 3.11.5.
I have a script that takes a query, loads that into a pandas dataframe, convert to a polars dataframe and then check if any column contains a specific value and returns those columns for data quality checks.
In this case I load data via SQL query into a pandas dataframe (library only supports an "export_to_pandas" function).
df = c.export_to_pandas('''SELECT * FROM t''')
This dataframe has a column
A
which contains numbers in the beginning but also values like "D - 12345". In the pandas dataframe this column has type "object".When I try to convert this dataframe to polars via
df = pl.from_pandas(df)
this results in a
because apparently polars scans the first N rows, sees only numbers and some NULL values and decides that this column is int64 but then it fails because "D - 12345" is no int64 and the column should stay "object" / "Utf8".
I don't see any way in https://pola-rs.github.io/polars/py-polars/html/reference/api/polars.from_pandas.html to use something like "infer_schema_length" like stated in https://github.com/pola-rs/polars/issues/4489 (but it seems this did not cover the
from_pandas
function) and be able to set it to max because the query changes everytime and I can't hardcode anything.