Open baggiponte opened 1 year ago
Could you make a PR?
with great pleasure!
Hey @baggiponte , I have a pd data frame which I have read from xls file and now I am converting to polars data frame but I getting the error which you mentioned in your first comment. Here is my code:
df_pd = pd.read_excel(BytesIO(data))
dtype_mapping = {col: pl.String for col in df_pd.columns}
# Convert Pandas DataFrame to Polars DataFrame
df = pl.DataFrame(df_pd, schema=dtype_mapping)
Here is the error which I am getting :
---------------------------------------------------------------------------
ArrowInvalid Traceback (most recent call last)
Cell In[9], line 8
7 try:`
----> 8 data = getProfileSavePreview(getDataset(str(id)))`
9 print("-------------------------------------------------------------------------------------")
Cell In[8], line 77, in getProfileSavePreview(dataset, max_rows, max_preview)
75 logger.info("here "+key)
---> 77 df, filesize = s3_to_df(key)
79 if not filesize:
Cell In[4], line 89, in s3_to_df(file_url, enc)
88 # Convert Pandas DataFrame to Polars DataFrame
---> 89 df = pl.DataFrame(df_pd, schema=dtype_mapping)
92 elif file_url.lower().endswith('.json'):
93 # Convert bytes to string and then create DataFrame
File ~/anaconda3/lib/python3.11/site-packages/polars/dataframe/frame.py:405, in DataFrame.__init__(self, data, schema, schema_overrides, orient, infer_schema_length, nan_to_null)
404 elif _check_for_pandas(data) and isinstance(data, pd.DataFrame):
--> 405 self._df = pandas_to_pydf(
406 data, schema=schema, schema_overrides=schema_overrides
407 )
409 elif not isinstance(data, Sized) and isinstance(data, (Generator, Iterable)):
File ~/anaconda3/lib/python3.11/site-packages/polars/utils/_construction.py:1829, in pandas_to_pydf(data, schema, schema_overrides, rechunk, nan_to_null, include_index)
1828 for col in data.columns:
-> 1829 arrow_dict[str(col)] = _pandas_series_to_arrow(
1830 data[col], nan_to_null=nan_to_null, length=length
1831 )
1833 arrow_table = pa.table(arrow_dict)
File ~/anaconda3/lib/python3.11/site-packages/polars/utils/_construction.py:662, in _pandas_series_to_arrow(values, length, nan_to_null)
661 return pa.nulls(length or len(values), pa.large_utf8())
--> 662 return pa.array(values, from_pandas=nan_to_null)
663 elif dtype:
File ~/anaconda3/lib/python3.11/site-packages/pyarrow/array.pxi:340, in pyarrow.lib.array()
File ~/anaconda3/lib/python3.11/site-packages/pyarrow/array.pxi:86, in pyarrow.lib._ndarray_to_array()
File ~/anaconda3/lib/python3.11/site-packages/pyarrow/error.pxi:91, in pyarrow.lib.check_status()
ArrowInvalid: Could not convert '...' with type str: tried to convert to double`
I have tried converting with and without schema but it is not working.
if you can help me out that would be great!!!!
hey there, thanks for pinging me. It's high time I address this. Can you provide just enough rows from the spreadsheet to make the issue reproducible? I don't have the data that made me file the issue in the first place.
Problem description
Problem I am trying to cast a
pandas.DataFrame
into apl.DataFrame
, but the conversion fails (not a bug). The error message is the following:ArrowInvalid: Could not convert '<value>' with type <type>: tried to convert to <type>
With plain
pyarrow
, e.g.pa.Table.from_pandas(data)
, the error would look like this:ArrowInvalid: ("Could not convert '<value>' with type <type>: tried to convert to <type>", 'Conversion failed for column <COLUMN-NAME> with type <pandas-type>')
Request The difference is that
pyarrow
reports the name of the column, which is helpful in debugging. I would like to ask if you wanted to add a similar behaviour topolars
.Extra details I looked at the codebase and found why the error messages differ, even though
pyarrow
is always used under the hood.The callstack when calling
pl.from_pandas()
is:polars.dataframe.frame._from_pandas()
polars.utils._construction.pandas_to_pydf()
, which iterates over every column in thepd.DataFrame
and callspolars.utils._construction._pandas_series_to_arrow()
on each._pandas_series_to_arrow
basically callspa.array()
on thepd.Series
: this is why the error is uninformative, becausepyarrow
is working with a vector.Possible solution I would say that the simplest solution would be to add a
try-catch
statement at this line inpandas_to_pydf
and raise apolars.exceptions.ArrowError
using the column name.Since it's a minor fix, I am available to contribute.