markfairbanks / tidypolars

Tidy interface to polars
http://tidypolars.readthedocs.io
MIT License
321 stars 10 forks source link

Is it possible to have dplyr's `group_by` + `mutate` behavior? #201

Closed tomicapretto closed 2 years ago

tomicapretto commented 2 years ago

First of all, I really like this package and I've started to use it a lot in my work. As a Pythonista whose first language is R, I really enjoy tidypolars.

In R, we can do something like the following

library(dplyr)
data(iris)

iris %>%
  group_by(Species) %>%
  mutate(
    result = Petal.Width - mean(Petal.Width)
  )

Since we have a group_by(Species) call, dplyr will subtract the mean that corresponds to each group in the mutate() operation (not the mean across all observations from all species).

As far as I understand, this is still not possible with tidypolars since we don't have a group_by function that behaves in a similar way to the one in dplyr. So my questions are

Again, thanks for the fantastic library!

markfairbanks commented 2 years ago

In tidypolars I decided to implement .group_by() slightly differently than in the tidyverse - if a function can operate "by group" you use the by arg. So this is how you would do it in your example.

import tidypolars as tp
from tidypolars import col

path = (
    "https://gist.githubusercontent.com/netj/8836201/" +
    "raw/6f9306ad21398ea43cba4f7d537619d0e07d5ae3/iris.csv"
)

iris = tp.read_csv(path).rename(species = 'variety')

(
    iris
    .mutate(
        result = col("petal.width") + tp.mean(col("petal.width")),
        by = "species"
    )
)
shape: (150, 6)
┌──────────────┬─────────────┬──────────────┬─────────────┬───────────┬────────┐
│ sepal.length ┆ sepal.width ┆ petal.length ┆ petal.width ┆ species   ┆ result │
│ ---          ┆ ---         ┆ ---          ┆ ---         ┆ ---       ┆ ---    │
│ f64          ┆ f64         ┆ f64          ┆ f64         ┆ str       ┆ f64    │
╞══════════════╪═════════════╪══════════════╪═════════════╪═══════════╪════════╡
│ 5.1          ┆ 3.5         ┆ 1.4          ┆ 0.2         ┆ Setosa    ┆ 0.446  │
├╌╌╌╌╌╌╌╌╌╌╌╌╌╌┼╌╌╌╌╌╌╌╌╌╌╌╌╌┼╌╌╌╌╌╌╌╌╌╌╌╌╌╌┼╌╌╌╌╌╌╌╌╌╌╌╌╌┼╌╌╌╌╌╌╌╌╌╌╌┼╌╌╌╌╌╌╌╌┤
│ 4.9          ┆ 3.0         ┆ 1.4          ┆ 0.2         ┆ Setosa    ┆ 0.446  │
├╌╌╌╌╌╌╌╌╌╌╌╌╌╌┼╌╌╌╌╌╌╌╌╌╌╌╌╌┼╌╌╌╌╌╌╌╌╌╌╌╌╌╌┼╌╌╌╌╌╌╌╌╌╌╌╌╌┼╌╌╌╌╌╌╌╌╌╌╌┼╌╌╌╌╌╌╌╌┤
│ 4.7          ┆ 3.2         ┆ 1.3          ┆ 0.2         ┆ Setosa    ┆ 0.446  │
├╌╌╌╌╌╌╌╌╌╌╌╌╌╌┼╌╌╌╌╌╌╌╌╌╌╌╌╌┼╌╌╌╌╌╌╌╌╌╌╌╌╌╌┼╌╌╌╌╌╌╌╌╌╌╌╌╌┼╌╌╌╌╌╌╌╌╌╌╌┼╌╌╌╌╌╌╌╌┤
│ 4.6          ┆ 3.1         ┆ 1.5          ┆ 0.2         ┆ Setosa    ┆ 0.446  │
├╌╌╌╌╌╌╌╌╌╌╌╌╌╌┼╌╌╌╌╌╌╌╌╌╌╌╌╌┼╌╌╌╌╌╌╌╌╌╌╌╌╌╌┼╌╌╌╌╌╌╌╌╌╌╌╌╌┼╌╌╌╌╌╌╌╌╌╌╌┼╌╌╌╌╌╌╌╌┤
│ ...          ┆ ...         ┆ ...          ┆ ...         ┆ ...       ┆ ...    │
├╌╌╌╌╌╌╌╌╌╌╌╌╌╌┼╌╌╌╌╌╌╌╌╌╌╌╌╌┼╌╌╌╌╌╌╌╌╌╌╌╌╌╌┼╌╌╌╌╌╌╌╌╌╌╌╌╌┼╌╌╌╌╌╌╌╌╌╌╌┼╌╌╌╌╌╌╌╌┤
│ 6.3          ┆ 2.5         ┆ 5.0          ┆ 1.9         ┆ Virginica ┆ 3.926  │
├╌╌╌╌╌╌╌╌╌╌╌╌╌╌┼╌╌╌╌╌╌╌╌╌╌╌╌╌┼╌╌╌╌╌╌╌╌╌╌╌╌╌╌┼╌╌╌╌╌╌╌╌╌╌╌╌╌┼╌╌╌╌╌╌╌╌╌╌╌┼╌╌╌╌╌╌╌╌┤
│ 6.5          ┆ 3.0         ┆ 5.2          ┆ 2.0         ┆ Virginica ┆ 4.026  │
├╌╌╌╌╌╌╌╌╌╌╌╌╌╌┼╌╌╌╌╌╌╌╌╌╌╌╌╌┼╌╌╌╌╌╌╌╌╌╌╌╌╌╌┼╌╌╌╌╌╌╌╌╌╌╌╌╌┼╌╌╌╌╌╌╌╌╌╌╌┼╌╌╌╌╌╌╌╌┤
│ 6.2          ┆ 3.4         ┆ 5.4          ┆ 2.3         ┆ Virginica ┆ 4.326  │
├╌╌╌╌╌╌╌╌╌╌╌╌╌╌┼╌╌╌╌╌╌╌╌╌╌╌╌╌┼╌╌╌╌╌╌╌╌╌╌╌╌╌╌┼╌╌╌╌╌╌╌╌╌╌╌╌╌┼╌╌╌╌╌╌╌╌╌╌╌┼╌╌╌╌╌╌╌╌┤
│ 5.9          ┆ 3.0         ┆ 5.1          ┆ 1.8         ┆ Virginica ┆ 3.826  │
└──────────────┴─────────────┴──────────────┴─────────────┴───────────┴────────┘
markfairbanks commented 2 years ago

Lots of functions have the by arg so they can operate by group. mutate, filter, slice, summarize, etc.

Basically - if a function can operate "by group" in the tidyverse you'll be able to use the by arg in tidypolars.

Hope this helps! If you have any other questions let me know.

tomicapretto commented 2 years ago

Excellent! Thanks a lot for the prompt and awesome response!

markfairbanks commented 2 years ago

Saw your blog post and I'm glad tidypolars is working out for you!

Figured I would mention that tidypolars has a .drop_null() method. It works like the tidyverse's drop_na() or pandas .dropna() - though the .filter() approach you used works as well.

You can also use it to drop nulls from specific columns if you want.

# drop nulls from all columns
df.drop_null()

# drop nulls from "x" and "y"
df.drop_null('x', 'y')
tomicapretto commented 2 years ago

Awesome! I'll update the post!