pola-rs / pyo3-polars

Plugins/extension for Polars
MIT License
238 stars 39 forks source link

` #[polars_expr(type_func=...)]` always fails #30

Closed andysham closed 11 months ago

andysham commented 11 months ago

I haven't been able to get this feature to work locally, based on the example - note that the haversine expression isn't used in the test run.py file.

When I attempt to use the haversine expression, I get the following error

Traceback (most recent call last):
  File "/home/andy/git/pyo3-polars/example/derive_expression/run.py", line 13, in <module>
    out = df.with_columns(
  File "/home/andy/git/pyo3-polars/example/derive_expression/venv/lib/python3.10/site-packages/polars/dataframe/frame.py", line 7903, in with_columns
    return self.lazy().with_columns(*exprs, **named_exprs).collect(eager=True)
  File "/home/andy/git/pyo3-polars/example/derive_expression/venv/lib/python3.10/site-packages/polars/lazyframe/frame.py", line 3809, in with_columns
    return self._from_pyldf(self._ldf.with_columns(pyexprs))
pyo3_runtime.PanicException: called `Result::unwrap()` on an `Err` value: DlSym { desc: "/home/andy/git/pyo3-polars/example/derive_expression/expression_lib/expression_lib/expression_lib.cpython-310-x86_64-linux-gnu.so: undefined symbol: __polars_field_haversine" }
make: *** [Makefile:22: run] Error 1

Minimal reproduction

This modification to ./example/derive_expression/run.py returns the above error, everything else unchanged.

import polars as pl
from expression_lib import Language, Distance

df = pl.DataFrame({
    "names": ["Richard", "Alice", "Bob"],
    "moons": ["full", "half", "red"],
    "dist_a": [[12, 32, 1], [], [1, -2]],
    "dist_b": [[-12, 1], [43], [876, -45, 9]],
    "floats": [5.6, -1245.8, 242.224]
})

out = df.with_columns(
   pig_latin = pl.col("names").language.pig_latinnify(),
).with_columns(
    hamming_dist = pl.col("names").dist.hamming_distance("pig_latin"),
    jaccard_sim = pl.col("dist_a").dist.jaccard_similarity("dist_b"),
    haversine = pl.col("floats").dist.haversine("floats", "floats", "floats", "floats"),
)

print(out)

Potential fix

It seems that whenever the polars-plan crate resolves the type information of an FFI call, it looks for the symbol __polars_field_{}, where {} is the name of the Rust function under #[polars_expr(...)], see this LoC.

The function that is actually exposed for this field resolution is defined here for expressions of the pattern #[polars_expr(type_func=...)], and here for expressions of the pattern #[polars_expr(output_type=...)].

In the latter case, as fn_name is the function name itself (i.e. haversine in this case), get_field_name returns the correct symbol - in the former, fn_name is instead the name of the type resolution function haversine_output, meaning the symbol exposed in the FFI is instead __polars_field_haversine_output, which is unknown to polars-plan.

We can verify this by inspecting the .so file created by pyo3:

$ nm -D ./expression_lib/expression_lib/expression_lib.cpython-310-x86_64-linux-gnu.so | grep " T"
00000000001d0eb0 T __polars_field_hamming_distance
00000000001d14f0 T __polars_field_haversine_output
00000000001d0890 T __polars_field_jaccard_similarity
00000000001d0340 T __polars_field_pig_latinnify
00000000001d1110 T hamming_distance
00000000001d1710 T haversine
00000000001d0af0 T jaccard_similarity
00000000001d05a0 T pig_latinnify

If the fix is simply changing the symbol here, I have implemented this in this PR.