moodymudskipper / nakedpipe

Pipe Into a Sequence of Calls Without Repeating the Pipe Symbol.
69 stars 7 forks source link

data manipulation brainstorm #29

Open moodymudskipper opened 3 years ago

moodymudskipper commented 3 years ago

QUESTION 1 : General syntax

We use the ? operator to select columns from our data. It :

The former will be used for the equivalent of dplyr::across() operations.

starwars %.% {
  {
    ?c("mass ", "birth_year") = ~max(., na.rm = TRUE)
    ?is.integer = ~mean(., na.rm = TRUE)
  } ~ sex + gender
}

Note that we don't need := unlike in tidyverse because these are not named arguments.

We can support custom names, because ? can be binary.

"max_{col}"?c("mass ", "birth_year") = ~max(., na.rm = TRUE)
"mean_{col}"?is.integer = ~mean(., na.rm = TRUE)

In general the lhs of ? is used to rename the selection.

We could provide a vignette per dplyr help page, comparing all examples.

QUESTION 2: Should we support functions as rhs ?

so ?is.integer = ~max(.) could be written ?is.integer = max ?

It's unambiguous, but increases the chances of making mistakes.

Given that formulas are not much more verbose let's skip for now.

QUESTION 3: Should we support these for regular mutate/summarize calls ?

so we'd do for instance

data %.% {
  foo = ~toupper(.)
}

It seems harmless, and more consistent in fact. it's just that if we say yes to question 2, this would be risky

data %.% {
  foo = toupper
}

I think better hold back on 2 but say yes to 3

QUESTION 4 : How to select by regex

By using regex and () or ! or ~ with the following syntax we can do what the tidyselect stuff does.

? ("^Petal") = ~toupper(.)
? {"^Petal"} = ~toupper(.)
?! "^Petal" = ~toupper(.)
?~ "^Petal" = ~toupper(.)
? "~^Petal" = ~toupper(.)
? "/^Petal" = ~toupper(.)

A alternate solution would be to treat differently the rhs of ? if it is a string obeying a given format that we wouldn't expect for a column, maybe starting with "~".

QUESTION 5 : Should we expand this to select and with which features ?

I didn't really intend to propose some shorthands for selecting but with the above it seems to come for free. instead of select_if or select(starts_with(...), ...) when we can do :

df %.% {
  ?is.numeric
  ?~"^S"
}

I think this is also very good to introduce the fancy features, first selection, then how to rename, then mutate using ? to select the input and rename the output.

QUESTION 6 : how to combine selections ?

? selections on consecutive, lines means we keep the intersection of those.

How to select this OR that :

This can also be used for mutating and summarizing, though in that case we would miss a handy syntax for AND.

QUESTION 7 : Should we support renaming and how ?

I think it's non essential, but from what we have it follows almost naturally that we can do :

data %.% {
  "new_name" ?"old_name"
  "updated_{col}" ?c("old1", "old2")
}

Once we're comfortable with the fact that ? is used to select on the right side and rename on the left side it becomes intuitive enough.

QUESTION 8 : Should we force summarizing operations to return only one row per group ?

Our current way of doing it is just keeping grouping columns and applying any transformation, if it keeps the length then it will be like a grouped transmute call.

summarize used to impose it and fail if not respected, now it allows it. A third option is not to fail but to nest if output is longer, but it's likely to behave unpredictably.

We can use ~1~ instead of ~ to force the summary to be one row by group.

If we force it to be one row, we'll sometimes need to unnest in the following step, and we cannot pivot longer using the aggregation syntax.

This is related to the 2 next questions.

QUESTION 9 : How do we do grouped mutate calls ?

Simple ones can be handled with transform() :

starwars %.% {
  ?c("name", "mass", "homeworld")
  transform(rank = min_rank(desc(mass))) ~ homeworld
}

For grouped mutate_at etc we'd need another syntax I think

starwars %.% {
  ?c("name", "mass", "homeworld")
  {rank = min_rank(desc(mass)))} ~keep_all~ homeworld
}

This means the default is not to keep all, here we do, so summarized calls will be recycled using the unchanged values, this makes sense. We can add ~keep_unused~ and ~keep_used~ to our special cases, and they might be used with ~ (none) if we need those without groups. The latter are scarcely useful though.

An issue with this naming is that we might keep all columns but still aggregate all. A better and shorter way might be ~w~ for "window".

We talked above about having ~1~, we could also have ~m~ to have margins to our summaries (see ?reshape::melt), ~n~ might keeps unaggregated columns and nest them.

QUESTION 10 : Should we unnest and how ?

I'm really not sure we should, but tidyr::unnest has become a bit verbose and strict, and unnest_legacy() is long to type.

Maybe we can use ++ ?

++ ?c("col1", "col2")

The idea is that "+" is often used in UIs to mean "develop", so we'd have "+-" for unnest, "++" for unnest_longer, and "--" for unnest_wider, and we can use as many as we want on the same line. We can use ? notation or naked column names.

QUESTION 11 : Should we implement spread ?

see https://github.com/moodymudskipper/nakedpipe/issues/28

QUESTION 12 : then how to gather

see https://github.com/moodymudskipper/nakedpipe/issues/28

QUESTION 13 : how to group by "the other columns"

This can be handy and it's what spreading functions do implicitly.

In base R we often see the dot for this : vars1 ~ .

We use the dot a lot already, so I think vars1 ~ (unused) is better, the () signal that it's a special syntax. We could also have other special values, see 2 questions below.

QUESTION 14: rowwise operations?

What about:

foo(...) ~ (row) 

QUESTION 15: summarize without group ?

What about:

foo(...) ~ (none) 

Or :

foo(...) ~ NULL

I think I prefer (none), more consistent.

It will be useful for summarizing with one big group, or to transmute

QUESTION 16: Rethinking filtering

arises from a few observations :

I we use ? for column selection, we could use ?? for row selection so :

We see the same pattern as in our select vs rename, the unary call subsets while the binary call does not.

How about filter_if filter_at?

?all? (?is.numeric) > 0 # or `all?? (?is.numeric) > 0
?all? 0 < ?is.numeric # equivalent avoiding parens

The rhs of ?? should be a logical of same length as nrow, or recyclable, or a numeric, ?any? and ?all? can be used if we have a logical df or matrix of recyclable number of rows.

We'll still support current behavior, but discourage its use for complex conditions.

moodymudskipper commented 3 years ago

All questions are pretty well answered now, still hesitant about Q4 but we can pick one and change later.