ibis-project / ibis

the portable Python dataframe library
https://ibis-project.org
Apache License 2.0
4.3k stars 537 forks source link

bug: pandas class SingleBlockManager not defined when declaring pandas scalar udf input type as pd.Series #9069

Closed ted0928 closed 2 weeks ago

ted0928 commented 2 weeks ago

What happened?

I'm not sure if this problem is due to a bug or a usage issue. Here's my code

import ibis
import pandas as pd
from pyspark.sql import SparkSession
ibis.options.interactive = True
session = SparkSession.builder.getOrCreate()
connect = ibis.pyspark.connect(session)
source = connect.create_view('source', ibis.memtable(dict(id=[1,2,3], name=['a', 'b', 'c'])))

@ibis.udf.scalar.pandas
def add_one(x:int)->int:
    return x + 1

df = source.mutate(id2=add_one(source.id))

this works. But actually, x is a pd.Series type at runtime.

When i declare x as pd.Series,raise Exception when getting the type hint.

@ibis.udf.scalar.pandas
def add_one(x:pd.Series)->int:
    return x + 1
Traceback (most recent call last):
  File "/Users/ning.ln/Java/ibis/spark_test.py", line 8, in <module>
    @ibis.udf.scalar.pandas
     ^^^^^^^^^^^^^^^^^^^^^^
  File "/Users/ning.ln/Java/ibis/ibis/expr/operations/udf.py", line 384, in pandas
    return _wrap(
           ^^^^^^
  File "/Users/ning.ln/Java/ibis/ibis/expr/operations/udf.py", line 83, in _wrap
    return wrap(fn) if fn is not None else wrap
           ^^^^^^^^
  File "/Users/ning.ln/Java/ibis/ibis/expr/operations/udf.py", line 80, in wrap
    deferrable(wrapper(input_type, fn, **kwargs)), fn
               ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
  File "/Users/ning.ln/Java/ibis/ibis/expr/operations/udf.py", line 154, in _make_wrapper
    node = cls._make_node(fn, input_type, **kwargs)
           ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
  File "/Users/ning.ln/Java/ibis/ibis/expr/operations/udf.py", line 115, in _make_node
    fields = {
             ^
  File "/Users/ning.ln/Java/ibis/ibis/expr/operations/udf.py", line 117, in <dictcomp>
    pattern=rlz.ValueOf(annotations.get(arg_name)),
            ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
  File "/Users/ning.ln/Java/ibis/ibis/common/bases.py", line 72, in __call__
    return cls.__create__(*args, **kwargs)
           ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
  File "/Users/ning.ln/Java/ibis/ibis/common/grounds.py", line 119, in __create__
    kwargs = cls.__signature__.validate(cls, args, kwargs)
             ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
  File "/Users/ning.ln/Java/ibis/ibis/common/annotations.py", line 490, in validate
    result = pattern.match(value, this)
             ^^^^^^^^^^^^^^^^^^^^^^^^^^
  File "/Users/ning.ln/Java/ibis/ibis/common/patterns.py", line 570, in match
    return self.pattern.match(value, context)
           ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
  File "/Users/ning.ln/Java/ibis/ibis/common/patterns.py", line 792, in match
    value = self.func(value)
            ^^^^^^^^^^^^^^^^
  File "/Users/ning.ln/Java/ibis/ibis/expr/datatypes/core.py", line 127, in __coerce__
    return dtype(value)
           ^^^^^^^^^^^^
  File "/Users/ning.ln/Java/ibis/ibis/common/dispatch.py", line 140, in call
    return impl(arg, *args, **kwargs)
           ^^^^^^^^^^^^^^^^^^^^^^^^^^
  File "/Users/ning.ln/Java/ibis/ibis/expr/datatypes/core.py", line 62, in dtype
    return DataType.from_typehint(value)
           ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
  File "/Users/ning.ln/Java/ibis/ibis/expr/datatypes/core.py", line 203, in from_typehint
    elif annots := get_type_hints(typ):
                   ^^^^^^^^^^^^^^^^^^^
  File "/Users/ning.ln/anaconda3/envs/ibis-dev-arm64/lib/python3.11/typing.py", line 2377, in get_type_hints
    value = _eval_type(value, base_globals, base_locals)
            ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
  File "/Users/ning.ln/anaconda3/envs/ibis-dev-arm64/lib/python3.11/typing.py", line 395, in _eval_type
    return t._evaluate(globalns, localns, recursive_guard)
           ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
  File "/Users/ning.ln/anaconda3/envs/ibis-dev-arm64/lib/python3.11/typing.py", line 910, in _evaluate
    self.__forward_value__ = _eval_type(
                             ^^^^^^^^^^^
  File "/Users/ning.ln/anaconda3/envs/ibis-dev-arm64/lib/python3.11/typing.py", line 409, in _eval_type
    ev_args = tuple(_eval_type(a, globalns, localns, recursive_guard) for a in t.__args__)
              ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
  File "/Users/ning.ln/anaconda3/envs/ibis-dev-arm64/lib/python3.11/typing.py", line 409, in <genexpr>
    ev_args = tuple(_eval_type(a, globalns, localns, recursive_guard) for a in t.__args__)
                    ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
  File "/Users/ning.ln/anaconda3/envs/ibis-dev-arm64/lib/python3.11/typing.py", line 395, in _eval_type
    return t._evaluate(globalns, localns, recursive_guard)
           ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
  File "/Users/ning.ln/anaconda3/envs/ibis-dev-arm64/lib/python3.11/typing.py", line 905, in _evaluate
    eval(self.__forward_code__, globalns, localns),
    ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
  File "<string>", line 1, in <module>
NameError: name 'SingleBlockManager' is not defined

This is really confusing ~~

What version of ibis are you using?

main

What backend(s) are you using, if any?

pyspark

Relevant log output

No response

Code of Conduct

gforsyth commented 2 weeks ago

Hey @ted0928, thanks for raising this!

Ibis UDFs only accept Ibis expressions as their inputs, but those inputs can be converted to things like pyarrow tables, or pandas Series when the UDF is executed on the backend.

It was a little unclear from the docstring examples, since they mostly make use of single integers, and those are implicitly converted to Ibis string literals, and so they are ibis expressions.

Here's an example showing that a column in an Ibis table can be mutated using a pandas series method:

[ins] In [1]: import ibis

[ins] In [2]: con = ibis.pyspark.connect()

[ins] In [3]: ibis.set_backend(con)  # set backend to spark so memtable is created on con

[ins] In [4]: t = ibis.memtable(dict(int_col=[1, 2, 3], str_col=["a", "b", "c"]))

[ins] In [5]: @ibis.udf.scalar.pandas
         ...: def string_cap(x: str) -> str:
         ...:     return x.str.capitalize()

[ins] In [6]: ibis.options.interactive = True

[ins] In [7]: string_cap(t.str_col)
Out[7]: 
┏━━━━━━━━━━━━━━━━━━━━━━━┓
┃ string_cap_0(str_col) ┃
┡━━━━━━━━━━━━━━━━━━━━━━━┩
│ string                │
├───────────────────────┤
│ A                     │
│ B                     │
│ C                     │
└───────────────────────┘

Note that you still want to provide the column dtype in the UDF type signature, not pd.Series.

If you have an existing pandas Series, you can throw it into a memtable to make it work with a UDF:

[ins] In [8]: import pandas as pd

[ins] In [9]: intseries = pd.Series([4, 5, 6], dtype="int64")

[nav] In [10]: t = ibis.memtable(dict(intcol=intseries))

[ins] In [11]: t
Out[11]:

┏━━━━━━━━┓
┃ intcol ┃
┡━━━━━━━━┩
│ int64  │
├────────┤
│      4 │
│      5 │
│      6 │
└────────┘

[ins] In [12]: @ibis.udf.scalar.pandas
          ...: def add_one(x: int) -> int:
          ...:     return x + 1

[ins] In [13]: add_one(t.intcol)
Out[13]:

┏━━━━━━━━━━━━━━━━━━━┓
┃ add_one_0(intcol) ┃
┡━━━━━━━━━━━━━━━━━━━┩
│ int64             │
├───────────────────┤
│                 5 │
│                 6 │
│                 7 │
└───────────────────┘

Hopefully that helps clear things up a little, and thanks for drawing our attention to the confusing docstrings (it tripped me up, too).

If you have other usage questions, please feel free to open issues or you can come join our Zulip!

ted0928 commented 2 weeks ago

Hey @ted0928, thanks for raising this!

Ibis UDFs only accept Ibis expressions as their inputs, but those inputs can be converted to things like pyarrow tables, or pandas Series when the UDF is executed on the backend.

It was a little unclear from the docstring examples, since they mostly make use of single integers, and those are implicitly converted to Ibis string literals, and so they are ibis expressions.

Here's an example showing that a column in an Ibis table can be mutated using a pandas series method:

[ins] In [1]: import ibis

[ins] In [2]: con = ibis.pyspark.connect()

[ins] In [3]: ibis.set_backend(con)  # set backend to spark so memtable is created on con

[ins] In [4]: t = ibis.memtable(dict(int_col=[1, 2, 3], str_col=["a", "b", "c"]))

[ins] In [5]: @ibis.udf.scalar.pandas
         ...: def string_cap(x: str) -> str:
         ...:     return x.str.capitalize()

[ins] In [6]: ibis.options.interactive = True

[ins] In [7]: string_cap(t.str_col)
Out[7]: 
┏━━━━━━━━━━━━━━━━━━━━━━━┓
┃ string_cap_0(str_col) ┃
┡━━━━━━━━━━━━━━━━━━━━━━━┩
│ string                │
├───────────────────────┤
│ A                     │
│ B                     │
│ C                     │
└───────────────────────┘

Note that you still want to provide the column dtype in the UDF type signature, not pd.Series.

If you have an existing pandas Series, you can throw it into a memtable to make it work with a UDF:

[ins] In [8]: import pandas as pd

[ins] In [9]: intseries = pd.Series([4, 5, 6], dtype="int64")

[nav] In [10]: t = ibis.memtable(dict(intcol=intseries))

[ins] In [11]: t
Out[11]:

┏━━━━━━━━┓
┃ intcol ┃
┡━━━━━━━━┩
│ int64  │
├────────┤
│      4 │
│      5 │
│      6 │
└────────┘

[ins] In [12]: @ibis.udf.scalar.pandas
          ...: def add_one(x: int) -> int:
          ...:     return x + 1

[ins] In [13]: add_one(t.intcol)
Out[13]:

┏━━━━━━━━━━━━━━━━━━━┓
┃ add_one_0(intcol) ┃
┡━━━━━━━━━━━━━━━━━━━┩
│ int64             │
├───────────────────┤
│                 5 │
│                 6 │
│                 7 │
└───────────────────┘

Hopefully that helps clear things up a little, and thanks for drawing our attention to the confusing docstrings (it tripped me up, too).

If you have other usage questions, please feel free to open issues or you can come join our Zulip!

Thank u so much for giving me a string literals example, it feels much more clear to me!