machow / siuba

Python library for using dplyr like syntax with pandas and SQL
https://siuba.org
MIT License
1.14k stars 48 forks source link

feat: symbolic support numpy ufuncs #415

Closed machow closed 2 years ago

machow commented 2 years ago

Addresses #399, by adding a dispatcher named array_ufunc, that is used in Symbolic.__array_ufunc__.

TODO:

Example:

from siuba.data import mtcars
from siuba import _, mutate, head
import numpy as np

res1 = mtcars >> mutate(res = _.hp - np.mean(_.hp))
res2 = mtcars >> mutate(res = _.hp - _.hp.mean())

# note that they are only equal because there are no missing values
res1.equals(res2)      # True

Here's what the symbolic looks like

np.sqrt(_)
█─'__call__'
├─█─'__custom_func__'
│ └─<function array_ufunc at 0x106959310>
├─_
├─<ufunc 'sqrt'>
├─'__call__'
└─_
machow commented 2 years ago

Quick note--it seems like np.sum is not itself a ufunc....

from siuba import _
import numpy as np

np.sum(_)
█─'__call__'
├─█─.
│ ├─_
│ └─'sum'
├─axis = None
└─out = None

edit:

This also breaks with polars, and seems to be more of a numpy issue:

import polars as pl

df = pl.read_csv("https://j.mp/iriscsv")

# okay
np.sqrt(df.sepal_length)
# error
np.sum(df.sepal_length)
machow commented 2 years ago

See https://github.com/numpy/numpy/issues/21387 for added context