Closed ted0928 closed 2 weeks ago
Hey @ted0928, thanks for raising this!
Ibis UDFs only accept Ibis expressions as their inputs, but those inputs can be converted to things like pyarrow tables, or pandas Series when the UDF is executed on the backend.
It was a little unclear from the docstring examples, since they mostly make use of single integers, and those are implicitly converted to Ibis string literals, and so they are ibis expressions.
Here's an example showing that a column in an Ibis table can be mutated using a pandas series method:
[ins] In [1]: import ibis
[ins] In [2]: con = ibis.pyspark.connect()
[ins] In [3]: ibis.set_backend(con) # set backend to spark so memtable is created on con
[ins] In [4]: t = ibis.memtable(dict(int_col=[1, 2, 3], str_col=["a", "b", "c"]))
[ins] In [5]: @ibis.udf.scalar.pandas
...: def string_cap(x: str) -> str:
...: return x.str.capitalize()
[ins] In [6]: ibis.options.interactive = True
[ins] In [7]: string_cap(t.str_col)
Out[7]:
┏━━━━━━━━━━━━━━━━━━━━━━━┓
┃ string_cap_0(str_col) ┃
┡━━━━━━━━━━━━━━━━━━━━━━━┩
│ string │
├───────────────────────┤
│ A │
│ B │
│ C │
└───────────────────────┘
Note that you still want to provide the column dtype in the UDF type signature, not pd.Series
.
If you have an existing pandas Series, you can throw it into a memtable
to make it work with a UDF:
[ins] In [8]: import pandas as pd
[ins] In [9]: intseries = pd.Series([4, 5, 6], dtype="int64")
[nav] In [10]: t = ibis.memtable(dict(intcol=intseries))
[ins] In [11]: t
Out[11]:
┏━━━━━━━━┓
┃ intcol ┃
┡━━━━━━━━┩
│ int64 │
├────────┤
│ 4 │
│ 5 │
│ 6 │
└────────┘
[ins] In [12]: @ibis.udf.scalar.pandas
...: def add_one(x: int) -> int:
...: return x + 1
[ins] In [13]: add_one(t.intcol)
Out[13]:
┏━━━━━━━━━━━━━━━━━━━┓
┃ add_one_0(intcol) ┃
┡━━━━━━━━━━━━━━━━━━━┩
│ int64 │
├───────────────────┤
│ 5 │
│ 6 │
│ 7 │
└───────────────────┘
Hopefully that helps clear things up a little, and thanks for drawing our attention to the confusing docstrings (it tripped me up, too).
If you have other usage questions, please feel free to open issues or you can come join our Zulip!
Hey @ted0928, thanks for raising this!
Ibis UDFs only accept Ibis expressions as their inputs, but those inputs can be converted to things like pyarrow tables, or pandas Series when the UDF is executed on the backend.
It was a little unclear from the docstring examples, since they mostly make use of single integers, and those are implicitly converted to Ibis string literals, and so they are ibis expressions.
Here's an example showing that a column in an Ibis table can be mutated using a pandas series method:
[ins] In [1]: import ibis [ins] In [2]: con = ibis.pyspark.connect() [ins] In [3]: ibis.set_backend(con) # set backend to spark so memtable is created on con [ins] In [4]: t = ibis.memtable(dict(int_col=[1, 2, 3], str_col=["a", "b", "c"])) [ins] In [5]: @ibis.udf.scalar.pandas ...: def string_cap(x: str) -> str: ...: return x.str.capitalize() [ins] In [6]: ibis.options.interactive = True [ins] In [7]: string_cap(t.str_col) Out[7]: ┏━━━━━━━━━━━━━━━━━━━━━━━┓ ┃ string_cap_0(str_col) ┃ ┡━━━━━━━━━━━━━━━━━━━━━━━┩ │ string │ ├───────────────────────┤ │ A │ │ B │ │ C │ └───────────────────────┘
Note that you still want to provide the column dtype in the UDF type signature, not
pd.Series
.If you have an existing pandas Series, you can throw it into a
memtable
to make it work with a UDF:[ins] In [8]: import pandas as pd [ins] In [9]: intseries = pd.Series([4, 5, 6], dtype="int64") [nav] In [10]: t = ibis.memtable(dict(intcol=intseries)) [ins] In [11]: t Out[11]: ┏━━━━━━━━┓ ┃ intcol ┃ ┡━━━━━━━━┩ │ int64 │ ├────────┤ │ 4 │ │ 5 │ │ 6 │ └────────┘ [ins] In [12]: @ibis.udf.scalar.pandas ...: def add_one(x: int) -> int: ...: return x + 1 [ins] In [13]: add_one(t.intcol) Out[13]: ┏━━━━━━━━━━━━━━━━━━━┓ ┃ add_one_0(intcol) ┃ ┡━━━━━━━━━━━━━━━━━━━┩ │ int64 │ ├───────────────────┤ │ 5 │ │ 6 │ │ 7 │ └───────────────────┘
Hopefully that helps clear things up a little, and thanks for drawing our attention to the confusing docstrings (it tripped me up, too).
If you have other usage questions, please feel free to open issues or you can come join our Zulip!
Thank u so much for giving me a string literals example, it feels much more clear to me!
What happened?
I'm not sure if this problem is due to a bug or a usage issue. Here's my code
this works. But actually, x is a pd.Series type at runtime.
When i declare x as pd.Series,raise Exception when getting the type hint.
This is really confusing ~~
What version of ibis are you using?
main
What backend(s) are you using, if any?
pyspark
Relevant log output
No response
Code of Conduct