data manipulation brainstorm

QUESTION 1 : General syntax

We use the ? operator to select columns from our data. It :

describes a selection of columns to modify if on the lhs of =
returns a data frame with selected columns if used somewhere on the rhs of =
describes a selection of columns to use as groups, if on the rhs of ~

The former will be used for the equivalent of dplyr::across() operations.

starwars %.% {
  {
    ?c("mass ", "birth_year") = ~max(., na.rm = TRUE)
    ?is.integer = ~mean(., na.rm = TRUE)
  } ~ sex + gender
}

Note that we don't need := unlike in tidyverse because these are not named arguments.

We can support custom names, because ? can be binary.

"max_{col}"?c("mass ", "birth_year") = ~max(., na.rm = TRUE)
"mean_{col}"?is.integer = ~mean(., na.rm = TRUE)

In general the lhs of ? is used to rename the selection.

We could provide a vignette per dplyr help page, comparing all examples.

QUESTION 2: Should we support functions as rhs ?

so ?is.integer = ~max(.) could be written ?is.integer = max ?

It's unambiguous, but increases the chances of making mistakes.

Given that formulas are not much more verbose let's skip for now.

QUESTION 3: Should we support these for regular mutate/summarize calls ?

so we'd do for instance

data %.% {
  foo = ~toupper(.)
}

It seems harmless, and more consistent in fact. it's just that if we say yes to question 2, this would be risky

data %.% {
  foo = toupper
}

I think better hold back on 2 but say yes to 3

QUESTION 4 : How to select by regex

By using regex and () or ! or ~ with the following syntax we can do what the tidyselect stuff does.

? ("^Petal") = ~toupper(.)
? {"^Petal"} = ~toupper(.)
?! "^Petal" = ~toupper(.)
?~ "^Petal" = ~toupper(.)
? "~^Petal" = ~toupper(.)
? "/^Petal" = ~toupper(.)

"~" is often associated with regex but we use it already to aggregate, for lambdas, and for side effects (unary ~~). Another issue is then we cannot use ~ for lambdas after ? or lambdas will have to be between ().
we already have a lot of {} too
other unary ops are +, - and I don't think they look good here.
?! ".*" prevents us to use ! to negate, and "foo" ?! ".*" is not very readable.
? (".*") doesn't look good when the regex contains "(" (with named captures the regex will contain both ? and ().

A alternate solution would be to treat differently the rhs of ? if it is a string obeying a given format that we wouldn't expect for a column, maybe starting with "~".

QUESTION 5 : Should we expand this to select and with which features ?

I didn't really intend to propose some shorthands for selecting but with the above it seems to come for free. instead of select_if or select(starts_with(...), ...) when we can do :

df %.% {
  ?is.numeric
  ?~"^S"
}

I think this is also very good to introduce the fancy features, first selection, then how to rename, then mutate using ? to select the input and rename the output.

QUESTION 6 : how to combine selections ?

? selections on consecutive, lines means we keep the intersection of those.

How to select this OR that :

anonymous functions : ?(~sapply(., is.numeric) | grepl("^S", names(.)))
pile them up : ?is.numeric ?~"^S" : surprising but unambiguous and really compact.
use | or & with adjusted behaviors : ? is.numeric | {"^S"}

This can also be used for mutating and summarizing, though in that case we would miss a handy syntax for AND.

QUESTION 7 : Should we support renaming and how ?

I think it's non essential, but from what we have it follows almost naturally that we can do :

data %.% {
  "new_name" ?"old_name"
  "updated_{col}" ?c("old1", "old2")
}

Once we're comfortable with the fact that ? is used to select on the right side and rename on the left side it becomes intuitive enough.

QUESTION 8 : Should we force summarizing operations to return only one row per group ?

Our current way of doing it is just keeping grouping columns and applying any transformation, if it keeps the length then it will be like a grouped transmute call.

summarize used to impose it and fail if not respected, now it allows it. A third option is not to fail but to nest if output is longer, but it's likely to behave unpredictably.

We can use ~1~ instead of ~ to force the summary to be one row by group.

If we force it to be one row, we'll sometimes need to unnest in the following step, and we cannot pivot longer using the aggregation syntax.

This is related to the 2 next questions.

QUESTION 9 : How do we do grouped mutate calls ?

Simple ones can be handled with transform() :

starwars %.% {
  ?c("name", "mass", "homeworld")
  transform(rank = min_rank(desc(mass))) ~ homeworld
}

For grouped mutate_at etc we'd need another syntax I think

starwars %.% {
  ?c("name", "mass", "homeworld")
  {rank = min_rank(desc(mass)))} ~keep_all~ homeworld
}

This means the default is not to keep all, here we do, so summarized calls will be recycled using the unchanged values, this makes sense. We can add ~keep_unused~ and ~keep_used~ to our special cases, and they might be used with ~ (none) if we need those without groups. The latter are scarcely useful though.

An issue with this naming is that we might keep all columns but still aggregate all. A better and shorter way might be ~w~ for "window".

We talked above about having ~1~, we could also have ~m~ to have margins to our summaries (see ?reshape::melt), ~n~ might keeps unaggregated columns and nest them.

QUESTION 10 : Should we unnest and how ?

I'm really not sure we should, but tidyr::unnest has become a bit verbose and strict, and unnest_legacy() is long to type.

Maybe we can use ++ ?

++ ?c("col1", "col2")

The idea is that "+" is often used in UIs to mean "develop", so we'd have "+-" for unnest, "++" for unnest_longer, and "--" for unnest_wider, and we can use as many as we want on the same line. We can use ? notation or naked column names.

QUESTION 11 : Should we implement spread ?

see https://github.com/moodymudskipper/nakedpipe/issues/28

QUESTION 12 : then how to gather

see https://github.com/moodymudskipper/nakedpipe/issues/28

QUESTION 13 : how to group by "the other columns"

This can be handy and it's what spreading functions do implicitly.

In base R we often see the dot for this : vars1 ~ .

We use the dot a lot already, so I think vars1 ~ (unused) is better, the () signal that it's a special syntax. We could also have other special values, see 2 questions below.

QUESTION 14: rowwise operations?

What about:

foo(...) ~ (row)

QUESTION 15: summarize without group ?

What about:

foo(...) ~ (none)

Or :

foo(...) ~ NULL

I think I prefer (none), more consistent.

It will be useful for summarizing with one big group, or to transmute

QUESTION 16: Rethinking filtering

arises from a few observations :

Filtering as it is now works only if we use a given set of operators, for example is.na(foo) won't work
If we have a long conditional expression it's confusing because we have to read on to see if it is a condition, if we lose in reading what we spend in typing this is not good

I we use ? for column selection, we could use ?? for row selection so :

we could use ?? is.na(foo), and of course things like ?? foo == 0
no matter how long are the expressions, it's obvious that we're subsetting

We see the same pattern as in our select vs rename, the unary call subsets while the binary call does not.

How about filter_if filter_at?

?all? (?is.numeric) > 0 # or `all?? (?is.numeric) > 0
?all? 0 < ?is.numeric # equivalent avoiding parens

The rhs of ?? should be a logical of same length as nrow, or recyclable, or a numeric, ?any? and ?all? can be used if we have a logical df or matrix of recyclable number of rows.

We'll still support current behavior, but discourage its use for complex conditions.

moodymudskipper / nakedpipe

data manipulation brainstorm #29