fix(pandas): make case work for non-RangeIndex dataframes

ibis-project / ibis

the portable Python dataframe library

https://ibis-project.org

Apache License 2.0

4.3k stars 537 forks source link

fix(pandas): make case work for non-RangeIndex dataframes #9083

Closed dlovell closed 2 weeks ago

dlovell commented 2 weeks ago

Description of changes

This PR makes PandasExecutor create the Series with an index that matches the incoming data. Currently, when the incoming data does not use a RangeIndex, the output index is a union of a RangeIndex and the incoming data index.

cpcloud commented 2 weeks ago

What's the use case that's enabled here that requires this change? What can you not do with the current codebase?

dlovell commented 2 weeks ago

What's the use case that's enabled here that requires this change? What can you not do with the current codebase?

dlovell commented 2 weeks ago

alternatively, should dataframes' index be sanitized on registration? is there a specification of what should be true about dataframes that are registered?

cpcloud commented 2 weeks ago

register a dataframe with non-range index and have cases work

It would be good to have a less abstract example documented in this PR. Doesn't really even have to be code, just some description that helps justify why we should take on any additional code to the pandas backend.

dlovell commented 2 weeks ago

register a dataframe with non-range index and have cases work

It would be good to have a less abstract example documented in this PR. Doesn't really even have to be code, just some description that helps justify why we should take on any additional code to the pandas backend.

Does this match what you're looking for?

When I run this code

def do_replace(col):
    return (
        col
        .cases(
            (
                (1, "one"),
                (2, "two"),
            ),
            default="unk",
        )
    )

df = pd.DataFrame({
    "A": pd.Series({i: i % 3 for i in (0, 1, 2, 4)}),
    "B": 0,
})
expr = ibis.pandas.connect({"t": df}).table("t")

print("Input")
print(len(expr.execute()))
print(expr.execute())
print()

print("Current results")
x = expr.mutate(**{"A": lambda t: t["A"].pipe(do_replace)}).execute()
print(len(x))
print(x)
print()

I get these results

Input
4
   A  B
0  0  0
1  1  0
2  2  0
4  1  0

Current results
5
     A    B
0  unk  0.0
1  one  0.0
2  two  0.0
3  one  NaN
4  NaN  0.0

cpcloud commented 2 weeks ago

Heh, yeah, that does :)

I'll just keep suggesting that the pandas backend should probably be avoided.

I somewhat reluctantly will accept this PR, fully realizing that creating the pandas backend was probably a doomed idea from the start 😂