summarize() can't use np.sqrt

machow / siuba

Python library for using dplyr like syntax with pandas and SQL

https://siuba.org

MIT License

1.16k stars 49 forks source link

summarize() can't use np.sqrt #399

Closed HuangHam closed 2 years ago

HuangHam commented 2 years ago

Hi! So glad to find a tidyverse equivalent in python. I encountered the following issue:

data = pd.concat([df_human, df_sim]) >> \ groupby(.subj, .trial, .split, .agent,.inequality) >> \ summarize(reward = np.sqrt(np.mean(_.reward)))

Note I wanted the square root of the mean of the variable named reward. but this gives me an error: invalid __array_struct__ . This error doesn't show up for other np functions such as np.size, np.mean, np.std. So I'm really confused...

machow commented 2 years ago

Hey--

siuba uses the _ to represent lazy expressions on data, rather than the data itself.
you can use pandas methods with it. E.g. _.x.mean()
you can replace calls like
- bad: np.sqrt(some_pandas_series)
- good: some_pandas_series.pipe(np.sqrt)
- siuba: _.some_pandas_series.pipe(np.sqrt)

from pandas import Series
import numpy as np

ser = Series([1,2,3])

# doesn't work when translated to siuba
np.sqrt(ser)

# use this
ser.pipe(np.sqrt)

machow commented 2 years ago

I'll work on supporting calls like np.sqrt(_.some_col) using numpy's dispatch mechanisms. (But it might not be possible).

machow commented 2 years ago

Fixed in version 0.2.3!

from siuba.data import mtcars
from siuba import _, mutate, group_by
import numpy as np

mtcars >> group_by(_.cyl) >> mutate(res = np.sqrt(np.mean(_.hp)))