pola-rs / polars

Dataframes powered by a multithreaded, vectorized query engine, written in Rust
https://docs.pola.rs
Other
30.65k stars 1.99k forks source link

Auto infer output type coming from `np.vectorize` #12607

Open gab23r opened 1 year ago

gab23r commented 1 year ago

Description

When I use np.vectorize, I can give the type of the output using otypes. It would be nice if polars could use this inforamtion (if given) to cast to the right type.

Here I give otypes=[int] but I get a Utf8 dtype :

import polars as pl
import numpy as np

levenshtein = pl.reduce(np.vectorize(Levenshtein.distance, otypes=[int]), [pl.col('A'), pl.col('B')]).explode()
pl.LazyFrame({'A': list('abc'), 'B': list('edc')}).with_columns(levenshtein = levenshtein).schema
# OrderedDict([('A', Utf8), ('B', Utf8), ('levenshtein', Utf8)])
alexander-beedie commented 1 year ago

I can see how to improve this... looks like we should probably add a new (optional) "return_dtype" argument for reduce (and probably also for fold et al.) so that the caller can specify the output dtype more precisely when known (and if we can infer it on the Python side - like we could here- and the caller doesn't set it, we can do some additional inference) 🤔

deanm0000 commented 1 year ago

On a semi related note @ion-elgreco made this extension which has levenshtein distance.

deanm0000 commented 1 year ago

@alexander-beedie as long as you're contemplating changes to reduce, is there anything to be done for this issue?

gab23r commented 1 year ago

Yes, I was using @ion-elgreco 's plugin (and had a X2 speedup). But this extension is not yet available with the new polars version

ion-elgreco commented 1 year ago

@gab23r I'll push a new release tomorrow!