pola-rs / polars

Dataframes powered by a multithreaded, vectorized query engine, written in Rust
https://docs.pola.rs
Other
30.34k stars 1.96k forks source link

Object Series with string elements converts to pyarrow FixedSizeBinaryArray, not LargeStringArray #15085

Open Wainberg opened 8 months ago

Wainberg commented 8 months ago

Checks

Reproducible example

>>> pl.Series(['foo', 'bar']).to_arrow()  # ok
<pyarrow.lib.LargeStringArray object at 0x7f6bf91ec2e0>
[
  "foo",
  "bar"
]
>>> pl.Series([b'foo', b'bar'], dtype=pl.Object).to_arrow()  # ok
<pyarrow.lib.FixedSizeBinaryArray object at 0x7f6bf91ec2e0>
[
  805086F96B7F0000,
  0058C4EF6B7F0000
]
>>> pl.Series(['foo', 'bar'], dtype=pl.Object).to_arrow()  # ??
<pyarrow.lib.FixedSizeBinaryArray object at 0x7f6bf91ec7c0>
[
  106086F96B7F0000,
  104B86F96B7F0000
]

Log output

No response

Issue description

pl.Series(['foo', 'bar'], dtype=pl.Object).to_arrow() returns a FixedSizeBinaryArray.

Expected behavior

It should convert to LargeStringArray, not FixedSizeBinaryArray.

Installed versions

``` --------Version info--------- Polars: 0.20.9 Index type: UInt32 Platform: Linux-4.4.0-22621-Microsoft-x86_64-with-glibc2.35 Python: 3.12.2 | packaged by conda-forge | (main, Feb 16 2024, 20:50:58) [GCC 12.3.0] ----Optional dependencies---- adbc_driver_manager: cloudpickle: connectorx: deltalake: fsspec: gevent: hvplot: matplotlib: 3.8.3 numpy: 1.26.4 openpyxl: 3.1.2 pandas: 2.2.0 pyarrow: 14.0.2 pydantic: pyiceberg: pyxlsb: sqlalchemy: xlsx2csv: 0.8.1 xlsxwriter: 3.1.9 ```
ritchie46 commented 8 months ago

They are objects. They are opaque and should not be successful in converting to Arrow. Arrow doesn't support objects. We should raise an error here.