pola-rs / polars

Dataframes powered by a multithreaded, vectorized query engine, written in Rust
https://docs.pola.rs
Other
27.97k stars 1.71k forks source link

`ArrowInvalid` error message: add column name when conversion with `pl.from_pandas()` fails #8546

Open baggiponte opened 1 year ago

baggiponte commented 1 year ago

Problem description

Problem I am trying to cast a pandas.DataFrame into a pl.DataFrame, but the conversion fails (not a bug). The error message is the following: ArrowInvalid: Could not convert '<value>' with type <type>: tried to convert to <type>

With plain pyarrow, e.g. pa.Table.from_pandas(data), the error would look like this: ArrowInvalid: ("Could not convert '<value>' with type <type>: tried to convert to <type>", 'Conversion failed for column <COLUMN-NAME> with type <pandas-type>')

Request The difference is that pyarrow reports the name of the column, which is helpful in debugging. I would like to ask if you wanted to add a similar behaviour to polars.

Extra details I looked at the codebase and found why the error messages differ, even though pyarrow is always used under the hood.

The callstack when calling pl.from_pandas() is:

  1. polars.dataframe.frame._from_pandas()
  2. polars.utils._construction.pandas_to_pydf(), which iterates over every column in the pd.DataFrame and calls polars.utils._construction._pandas_series_to_arrow() on each.
  3. _pandas_series_to_arrow basically calls pa.array() on the pd.Series: this is why the error is uninformative, because pyarrow is working with a vector.

Possible solution I would say that the simplest solution would be to add a try-catch statement at this line in pandas_to_pydf and raise a polars.exceptions.ArrowError using the column name.

    for col in data.columns:
        try:
            arrow_dict[str(col)] = _pandas_series_to_arrow(
                data[col], nan_to_null=nan_to_null, length=length
            )
        except pa.ArrowInvalid as e:
            raise pl.ArrowError(f"Failed to convert column {col!r} with dtype '{data[col].dtype}'") from e

Since it's a minor fix, I am available to contribute.

ritchie46 commented 1 year ago

Could you make a PR?

baggiponte commented 1 year ago

with great pleasure!

HarshilRami commented 4 months ago

Hey @baggiponte , I have a pd data frame which I have read from xls file and now I am converting to polars data frame but I getting the error which you mentioned in your first comment. Here is my code:

df_pd = pd.read_excel(BytesIO(data))
 dtype_mapping = {col: pl.String for col in df_pd.columns}
# Convert Pandas DataFrame to Polars DataFrame
df = pl.DataFrame(df_pd, schema=dtype_mapping)
Here is the error which I am getting :
---------------------------------------------------------------------------
ArrowInvalid                              Traceback (most recent call last)
Cell In[9], line 8
      7 try:`
----> 8     data = getProfileSavePreview(getDataset(str(id)))`
      9     print("-------------------------------------------------------------------------------------")

Cell In[8], line 77, in getProfileSavePreview(dataset, max_rows, max_preview)
     75     logger.info("here "+key)
---> 77     df, filesize = s3_to_df(key)
     79 if not filesize:

Cell In[4], line 89, in s3_to_df(file_url, enc)
     88     # Convert Pandas DataFrame to Polars DataFrame
---> 89     df = pl.DataFrame(df_pd, schema=dtype_mapping)
     92 elif file_url.lower().endswith('.json'):
     93     # Convert bytes to string and then create DataFrame

File ~/anaconda3/lib/python3.11/site-packages/polars/dataframe/frame.py:405, in DataFrame.__init__(self, data, schema, schema_overrides, orient, infer_schema_length, nan_to_null)
    404 elif _check_for_pandas(data) and isinstance(data, pd.DataFrame):
--> 405     self._df = pandas_to_pydf(
    406         data, schema=schema, schema_overrides=schema_overrides
    407     )
    409 elif not isinstance(data, Sized) and isinstance(data, (Generator, Iterable)):

File ~/anaconda3/lib/python3.11/site-packages/polars/utils/_construction.py:1829, in pandas_to_pydf(data, schema, schema_overrides, rechunk, nan_to_null, include_index)
   1828 for col in data.columns:
-> 1829     arrow_dict[str(col)] = _pandas_series_to_arrow(
   1830         data[col], nan_to_null=nan_to_null, length=length
   1831     )
   1833 arrow_table = pa.table(arrow_dict)

File ~/anaconda3/lib/python3.11/site-packages/polars/utils/_construction.py:662, in _pandas_series_to_arrow(values, length, nan_to_null)
    661         return pa.nulls(length or len(values), pa.large_utf8())
--> 662     return pa.array(values, from_pandas=nan_to_null)
    663 elif dtype:

File ~/anaconda3/lib/python3.11/site-packages/pyarrow/array.pxi:340, in pyarrow.lib.array()

File ~/anaconda3/lib/python3.11/site-packages/pyarrow/array.pxi:86, in pyarrow.lib._ndarray_to_array()

File ~/anaconda3/lib/python3.11/site-packages/pyarrow/error.pxi:91, in pyarrow.lib.check_status()

ArrowInvalid: Could not convert '...' with type str: tried to convert to double`

I have tried converting with and without schema but it is not working.

if you can help me out that would be great!!!!

baggiponte commented 4 months ago

hey there, thanks for pinging me. It's high time I address this. Can you provide just enough rows from the spreadsheet to make the issue reproducible? I don't have the data that made me file the issue in the first place.