ibis-project / ibis

the portable Python dataframe library
https://ibis-project.org
Apache License 2.0
5.26k stars 594 forks source link

feat: Sequence comparisons eg Array.Intersection(), Array.jaccard(), String.jaccard(), String.Levenshtein(), etc #6719

Closed NickCrews closed 1 year ago

NickCrews commented 1 year ago

Is your feature request related to a problem?

I am writing a record linkage framework on top of ibis. Part of this is the need to compare two records to determine if they are the same. Examples of this:

I can look up other backends if the number of supporting ones is important for you to decide if its worth it.

Describe the solution you'd like

Order of desire:

What version of ibis are you running?

main

What backend(s) are you using, if any?

duckdb, some other backends support this but not all.

Code of Conduct

jcrist commented 1 year ago

Hmmm, I wonder if these would be a good use case for our new UDF support. This isn't finished yet, but there's a future where the following would work:

con  = ibis.duckdb.connect()

# This would query the duckdb engine to automatically determine the signature of `jaccard`,
# so we can error nicely on misuse at expression build time rather than at run time.
# This works in some backends, but not in duckdb yet.
jaccard = con.function("jaccard")

t = con.table(...)
res = jaccard(t.x, t.y)

# We're also thinking about making a general way to define udf functions as expressions alone,
# which would make the expression-level versions backend agnostic. Something like (don't focus
# too much on the spelling).

@ibis.udf.builtin
def jaccard(x: str, y: str) -> float:
    ...  # this isn't eliding details, the function body is intended to be empty here

# If you're familiar with ibis's unbound tables, you might think of these as unbound functions.
# We just assume that they exist in the backend and that the specified schema is correct.

res = jaccard(t.x, t.y)

This might be a nice way to let you expose this functionality without adding a bunch of new methods to the various ibis types. Thoughts?

NickCrews commented 1 year ago

jaccard = con.function("jaccard")

This requires me to have access to the connection, which is tricky in a framework: The user just gives me inputs as Tables/Expressions, and I don't have access to the backend directly. So I would have to use the private method expr._find_backend(), which feels dirty. If you made this an official API, then I think this would work for me. Could we make it so you register on the entire class of backend, eg jaccard = ibis.duckdb.function("jaccard")? I guess you actually need a physical connection to do the lookup of the signature...

I like the second API, I think that would work! I'm not sure about calling it a UDF though, that gives me the impression that this python function is actually going to get called, like with other UDFs. What about registering it with ibis more directly (even if under the hood it uses the UDF machinery)?

@ibis.register_builtin
def jaccard(x: str, y: str) -> float:
    ...
jcrist commented 1 year ago

Agreed that for your use case the second option would be a better fit. I wouldn't focus too much on the naming here (there's lots of ways we could spell this), just wanted to see if conceptually this would solve your use case.

NickCrews commented 1 year ago

Correction: jaccard() and levenshtein(), while they appeared to work on arrays, actually don't: https://github.com/duckdb/duckdb/issues/8414

So this request should be dialed back to Array.intersection() and String.jaccard() and String.levenshtein() for now

cpcloud commented 1 year ago

Levenshtein might be a good candidate for a top level API, it's supported in at least 6 backends:

cpcloud commented 1 year ago

Ok, we've added levenshtein and array intersection.

I think Jaccard similarity is uncommon enough across our backends that it is a good candidate for testing the builtin function API as @jcrist suggests.

cpcloud commented 1 year ago

Closed by #7119.

Docs here: https://ibis-project.org/how-to/extending/builtin