pola-rs / polars

Dataframes powered by a multithreaded, vectorized query engine, written in Rust
https://docs.pola.rs
Other
30.35k stars 1.96k forks source link

Pandas interoperability: Inconsistencies with lists #19809

Open raayu83 opened 4 hours ago

raayu83 commented 4 hours ago

Checks

Reproducible example

from io import BytesIO, StringIO

import pandas as pd
import polars as pl

my_json = StringIO('[{ "mylist": ["a", "b", "c"] }]')

df_pl = pl.read_json(my_json).to_pandas()
df_pd = pd.read_json(my_json)

output_pl = BytesIO()
output_pd = BytesIO()

df_pl.to_csv(output_pl)
df_pd.to_csv(output_pd)

print(output_pl.getvalue())
print(output_pd.getvalue())

result: ´´´ b",mylist\r\n0,['a' 'b' 'c']\r\n" b',mylist\r\n0,"[\'a\', \'b\', \'c\']"\r\n' ´´´

Log output

No response

Issue description

This issues can lead to errors if you switch from pandas to polars when you pass data somewhere else using df.to_polars(). df.to_polars() is in a different format than if the df was created by pandas initially. When calling df.to_csv(), you can see that the comma separating the elements of the list is missing in the polars version.

I'm not 100% sure whether this is a bug or intentional by design. But switching from pandas to polars would be easier if the output after to_pandas was the same as the original pandas df.

In my case, I had do write some additional logic to replicate the behavior of pandas.

Expected behavior

df.to_polars() delivers a df identical to what would be produced if you would use pandas all along

Installed versions

``` --------Version info--------- Polars: 1.13.1 Index type: UInt32 Platform: Windows-10-10.0.22631-SP0 Python: 3.9.13 (tags/v3.9.13:6de2ca5, May 17 2022, 16:36:42) [MSC v.1929 64 bit (AMD64)] LTS CPU: False ----Optional dependencies---- adbc_driver_manager altair cloudpickle connectorx deltalake fastexcel 0.12.0 fsspec gevent great_tables matplotlib 3.9.2 nest_asyncio numpy 1.26.4 openpyxl pandas 2.2.3 pyarrow 18.0.0 pydantic pyiceberg sqlalchemy 2.0.36 torch xlsx2csv xlsxwriter 3.2.0 None ```
cmdlineluser commented 4 hours ago

It may be easier to see the difference with .to_dict() instead of csv.

my_json = StringIO('[{ "mylist": ["a", "b", "c"] }]')
pl.read_json(my_json).to_pandas().to_dict()

# {'mylist': {0: array(['a', 'b', 'c'], dtype=object)}}
my_json = StringIO('[{ "mylist": ["a", "b", "c"] }]')
pd.read_json(my_json).to_dict()

# {'mylist': {0: ['a', 'b', 'c']}}

The difference seems to be that you get a numpy array from .to_pandas()