moodymudskipper / nakedpipe

Pipe Into a Sequence of Calls Without Repeating the Pipe Symbol.
69 stars 7 forks source link

compact syntax for summarizing #26

Open moodymudskipper opened 3 years ago

moodymudskipper commented 3 years ago

Note : largely outdated, keeping around until all issues have moved to better places

I feel both dplyr and data.table are too verbose for most summarizing operations.

Moreover dplyr became very sophisticated with accross, but it leads to having summarize(across(where situations that I find awkward.

What about making those 2 equivalent :

starwars %.% {
  {
    max({"mass "; "birth_year"}, na.rm = TRUE)
    mean({is.integer}, na.rm = TRUE)
  } ~ sex + gender
}

starwars %>%
  group_by(sex, gender) %>%
  summarize(
    across(one_of("mass", "birth_year"), max, na.rm = TRUE),
    across(where(is.integer), mean, na.rm = TRUE)) %>%
  ungroup()

{} means "all", but it can be filled with symbols or litterals so it means"if" if it evaluates to a function, or "at" if it evaluates to a numeric or character.

{ has already several meanings, that's the flaw :

Also, the use of ; is really unorthodox.

But all in all I think it might be worth it.

The last post in https://github.com/moodymudskipper/nakedpipe/issues/21 proposes some simpler summarize behaviors to accompany it

moodymudskipper commented 3 years ago

rather than {}, better use ? :

starwars %.% {
  {
    max(?c("mass ", "birth_year"), na.rm = TRUE)
    mean(?is.integer, na.rm = TRUE)
  } ~ sex + gender
}

To say "all" we can keep the {} syntax or we do ?names(.), which isn't horrible to type.

Leveraging tidyselect if it's installed would be nice

moodymudskipper commented 3 years ago

A problem of the above syntax is if my input is used several times in my function.

We could in these cases use ?. not to repeat the ?c("mass ", "birth_year") expression

moodymudskipper commented 3 years ago

The other problem is how to deal with this with the debugging pipe.

I think we change the following :

{
    max(?c("mass ", "birth_year"), na.rm = TRUE)
    mean(?is.integer, na.rm = TRUE)
  } ~ sex + gender

to :

. <- naked_pipe::np_summarize(
  data = .,
  expr = { max(?c("mass ", "birth_year"), na.rm = TRUE);  mean(?is.integer, na.rm = TRUE) },
  by = sex + gender)

And we do our development inside of compute_by_group, so what the debugging pipe shows is what is really happening, and the standard syntax maps to it.

We can also have a function np_step, which takes data as the first arg and a nakedpipe step expression as the 2nd, would work as a placeholder to help with the technical debt caused by the debugging and translating features

moodymudskipper commented 3 years ago

? can also be used on the rhs, and we get the features of group_by_at, group_by_if

moodymudskipper commented 3 years ago

Some random ideas about the rhs :

moodymudskipper commented 3 years ago

Maybe ?"*" is reasonable to say "all", it's very unlikely to have a column named "*" and we can fail explicitly early if there is one