ibis-project / ibis

the portable Python dataframe library
https://ibis-project.org
Apache License 2.0
5.34k stars 600 forks source link

consider: should UDF implementations be scoped to a backend? #8748

Open NickCrews opened 8 months ago

NickCrews commented 8 months ago

Is your feature request related to a problem?

I have this UDF:

@ibis.udf.scalar.builtin
def damerau_levenshtein(left: str, right: str) -> int:
    ...

this only works in duckdb (or any backend with a builtin function called damerau_levenshtein).

I have some library function like def address_similarity(a1: ir.StringValue, a2: ir.StringValue) -> ir.FloatingValue. Internally it wants to use damerau levenshtein string edit distance to calculate the score. But, when a user hands me an abstract expression, I don't know what backend they are hoping to execute it on. If they are going to execute it on duckdb, then using the building UDF would work fine. But if they are going to execute it on a different backend, then I would want to fall back to some python/pyarrow UDF. But I don't know which to do at expression creation time!

Describe the solution you'd like

spitballing here:

# other args like name, database, etc aren't allowed here. This is just creating the contract on the ibis side.
@ibis.udf.scalar(signature=...)
def damerau_levenshtein(left: str, right: str) -> int: ...

# now we plug in implementations...
@damerau_levenshtein.builtin(backends=["duckdb", ...], name="damerau_levenshtein", database=...)
def _damerau_levensthein_duckdb(): ...

# backends=None means use this as the fallback
@damerau_levenshtein.python(backends=None, database=...)
def _damerau_levensthein_udf(s1: str, s2: str) -> str:
    return somelib.damlev(s1, s2)

def address_similarity(a1, a2):
   return damerau_levenshtein(a1, a2)

The old APIs should remain working as they did, I don't think they need to change?

What version of ibis are you running?

main

What backend(s) are you using, if any?

No response

Code of Conduct

cpcloud commented 7 months ago

It seems like this is something that many folks are wanting/asking about.

I think we should try to include this in the 10.0.0 release.