ibis-project / ibis

the portable Python dataframe library
https://ibis-project.org
Apache License 2.0
5.24k stars 591 forks source link

bug: execution of struct literals in the pandas and dask backends returns the wrong value #8687

Closed NickCrews closed 7 months ago

NickCrews commented 7 months ago

What happened?

import pandas as pd
from ibis.common.collections import frozendict
pd.Series([frozendict({"a": 5, "b": 6})])
# 0    (a, b)
# dtype: object

This is causing casting bugs in the pandas and dask backends:

import ibis

con = ibis.pandas.connect()

s = ibis.struct({"a": 1, "b": 2})
s2 = s.cast("struct<a: float64, b: float64>")
con.execute(s2)
# {'a': 'a'}

found this in https://github.com/ibis-project/ibis/pull/8666

I think we want to make it so our builtin frozendict looks exactly like a dict to pandas. Then we get the right behavior here

What version of ibis are you using?

main

What backend(s) are you using, if any?

No response

Relevant log output

No response

Code of Conduct

NickCrews commented 7 months ago

hmmm, actually maybe this is just a repr thing?? pd.Series([frozendict({"a": 5, "b": 6})])[0] actually shows the right thing...

EDIT: while the repr is confusing, the actual result from .execute() is also wrong.

NickCrews commented 7 months ago

I added a workaround at https://github.com/ibis-project/ibis/pull/8666/files#diff-7760dac05d92d4bd73e3b729a90b7002cc3920f03abd1dac5fc7ec18154f9fb2R87-R93. Would love someone's help with debugging and then fixing this at the root of the problem, it's too deep in ibis for me.

jcrist commented 7 months ago

hmmm, actually maybe this is just a repr thing??

Yeah, this is just a repr thing - if you make a custom collections.abc.Mapping and use it like you have, you'll see the same thing. My guess is pandas has a custom formatter for any non-builtin sequence, and that's what you're seeing here.

NickCrews commented 7 months ago

Ah, not specific enough, in my first examples its just a repr thing. But in the 2nd example above with the .execute(), it is a real problem that needs to get fixed in ibis.

cpcloud commented 7 months ago

it is a real problem that needs to get fixed in ibis.

Indeed, and PRs welcome. The pandas and dask backends are very low priority right now. Are you using them for something in particular?

jcrist commented 7 months ago

I've pushed up a PR in #8693 that makes frozendict a subclass of dict. This fixes both the repr-ing issue and the struct execution issue (it's also simpler and fixed a bug in the existing code). Should be able to revert the convert_Struct changes you added in #8666 once that's merged.

NickCrews commented 7 months ago

@cpcloud I see you changed the issue title to "subset of the struct". But just to make sure you see the difference, the current output is {"a": "a"}, not {"a": 5}

NickCrews commented 7 months ago

The pandas and dask backends are very low priority right now. Are you using them for something in particular?

No not at all. I just thought it would be better to get the the tests passing instead of marking them as xfail. If we want to punt on this, then I would think it would be better to just error on this cast, instead of what we do now which is silently return the wrong result.