tidyverse / dplyr

dplyr: A grammar of data manipulation
https://dplyr.tidyverse.org/
Other
4.78k stars 2.12k forks source link

morph() to automatically remove columns "used up" by a mutate() #3721

Closed ArtemSokolov closed 4 years ago

ArtemSokolov commented 6 years ago

Dear dplyr developers,

A recent Stack Overflow question raised an interesting use case of having columns fed to a mutate() call automatically removed from the result. To do this, the mutator would need to parse the input expressions to determine what symbols were used, and I made the first pass at designing such a function. The question author liked my answer and suggested that I contribute it to dplyr.

I am happy to work on a PR with a more robust implementation, but I wanted to check with you if such a feature would align with your design principles and the spirit of the package.

Thanks. Big fan of your work. -Artem

krlmlr commented 6 years ago
library(tidyverse)

#iris %>% transmutate(Petal.Area = Petal.Width * Petal.Length)
iris %>%
  as_tibble() %>% 
  mutate(Petal.Area = Petal.Width * Petal.Length) %>% 
  select(-Petal.Width, -Petal.Length)
#> # A tibble: 150 x 4
#>    Sepal.Length Sepal.Width Species Petal.Area
#>           <dbl>       <dbl> <fct>        <dbl>
#>  1          5.1         3.5 setosa       0.280
#>  2          4.9         3   setosa       0.280
#>  3          4.7         3.2 setosa       0.26 
#>  4          4.6         3.1 setosa       0.3  
#>  5          5           3.6 setosa       0.280
#>  6          5.4         3.9 setosa       0.68 
#>  7          4.6         3.4 setosa       0.42 
#>  8          5           3.4 setosa       0.3  
#>  9          4.4         2.9 setosa       0.280
#> 10          4.9         3.1 setosa       0.15 
#> # ... with 140 more rows

Created on 2018-07-21 by the reprex package (v0.2.0).

Thanks, I'm missing this functionality myself occasionally. Instead of parsing the expression, we could detect and record column access in the C++ code.

What should the verb do in a grouped scenario, if some groups access a different set of columns than other groups?

ArtemSokolov commented 6 years ago

I think there are two natural options: union - remove all columns that are accessed by at least one group, or intersection - remove only the columns that are accessed by all groups. It might be nice to be able to specify which of the two should be used, but I'm not sure where such an option would go...

krlmlr commented 6 years ago

I'd rather support only intersection, but that might be more difficult to implement than union.

The biggest problem I see with both options is that type stability is compromised -- the resulting data frame might end up with different columns, depending on the data. Perhaps the safest thing to do would be to raise an error if different columns are accessed for different groups from this verb.

What naming alternatives do we have? It might be difficult to remember the differences between mutate(), transmute() and transmutate().

ArtemSokolov commented 6 years ago

I agree that perhaps the appropriate action for accessing different columns across groups is to raise an error. I'm not sure I follow the type stability concerns; if the intersection of all accessed columns is removed from each group, is that not a consistent transformation of each group?

For naming alternatives, perhaps we can turn to thesaurus: https://www.thesaurus.com/browse/transmute If it doesn't become too annoying to type, metamorphose() might be a viable option.

EDIT: A nicer alternative might be alter(), as taken from https://www.thesaurus.com/browse/mutate

krlmlr commented 6 years ago

Suppose we have two group types: X and Y, the mutator code for group X accesses column a, for Y column b. If only groups of type X are present in the data, column a is accessed and removed; if both types are present, both columns a and b are accessed -- which to remove? Both intersect and union produce results inconsistent with the first scenario.

We create something, but also take something else away. How about trade() ?

ArtemSokolov commented 6 years ago

Thanks for the example, Kirill. That makes sense, and raising an error seems like the best approach to maintain consistency.

I think I have a slight preference towards alter(), because it is semantically similar to mutate() and transmute(). But trade() is a good choice as well!

krlmlr commented 6 years ago

morph() ?

ArtemSokolov commented 6 years ago

Yes!! morph() is perfect.

vlepori commented 6 years ago

Hi, just chipping in to say I would find this feature very useful too. I like the transmutate() name, as it is indeed a mixture of the two existing functions. Otherwise, I could see it being implemented as an additional argument to transmute(), as in drop.all = FALSE.

Thanks !

dekaufman commented 6 years ago

Hi! As the originator of the request on StackOverflow, great thanks to Artem for making it work and to all of you for picking up the banner.

Making this an optional argument to transmute() as Vasco suggests would perfectly represent my initial wish for the function simply to work a little differently on request. And then there's no difficulty in picking a new word.

If you make it a separate function, I like either morph() or transmutate().

moodymudskipper commented 6 years ago

I'd also like it better if it was integrated into transmute rather than adding a new verb (which would again be almost a synonym of merge, transform, gather... in English), with a unique additional parameter that would support the intersect and union options mentioned above.

It could also be a parameter of mutate where we'd have keep either "all" (default), "new" (default transmute behavior), "intersection" or "union", and then transmute would just be a shorthand for mutate with keep = "new".

mkoohafkan commented 6 years ago

tidygraph already uses morph for temporary graph topology modification, and it would be a shame to add this conflict.

I think that a whole new verb for a slight modification to mutate/transmute is overkill. An argument to transmute seems pretty reasonable, it just makes the decision of what to do with grouping variables explicit.

krlmlr commented 6 years ago

I like the idea of the .keep = "all" argument to mutate(), or perhaps .remove = "none" (with options "other", "used" and "used_once"). But it's rather low priority now.

mkoohafkan commented 6 years ago

Based on #3760 it might be nice to allow dropping grouping columns via this function as well.

romainfrancois commented 5 years ago

I totally missed this issue before. I really don't like how this would potentially complexity the C++ code when it's not that difficult to call a select() afterwards. @hadley ?

hadley commented 5 years ago

I'm not terribly enthusiastic about the idea as it seems complex and potentially brittle. I think it would be best prototyped in another package, and if it was successful there we might consider adding to dplyr.

krlmlr commented 5 years ago

Such a feature could be provided by an external package, but we'd have to rebuild a lot of logic that's already available in dplyr. We already know which columns have been accessed by a verb (=the materialized promises), we just need to remove these columns.

It's not difficult to add a select(), but more often than not it's redundant and duplicates code (=column names).

romainfrancois commented 5 years ago

Right.

Although the materialized environment is reset after each column has been treated, and is not dealt with in the hybrid path.

It’s not as much work as i thought but I still prefer explicit select().

lionel- commented 5 years ago

Querying which variables were accessed could be a feature of all eval_tidy() data masks once we have an ALTENV implementation in R.

hadley commented 5 years ago

I initially felt very negatively about this idea, but I've since realised that this is basically the semantics that tidyr uses in functions like nest(), separate(), extract(), unite() etc.

lionel- commented 5 years ago

Might morph() be a good name for this?

mkoohafkan commented 5 years ago

@lionel- Again, morph is used by tidygraph and it would be really disappointing to see this feature interfere with that package. Also, I'd argue that morph is hardly descriptive of the proposed feature. I'd suggest transmutate or @krlmlr's suggestion trade.

mkoohafkan commented 5 years ago

Also, I still think this functionality could be achieved with an argument to transmute or mutate, rather than requiring a whole new verb.

lionel- commented 5 years ago

I'd suggest transmutate or @krlmlr's suggestion trade.

I find those less descriptive of the proposed feature.

this functionality could be achieved with an argument to transmute or mutate

We're thinking about a new workflow based on tibble return values where morph() might play an important part. In that case, an argument would get in the way of important patterns. See https://github.com/tidyverse/tidyr/issues/523#issuecomment-461871357 for an example.

lionel- commented 5 years ago

Perhaps narrow() would work?

lionel- commented 5 years ago

narrow() is not correct because this operation can also widen. For instance, packing and unpacking a df-col:

mtcars %>% morph(df_col = tibble(cyl, am)) %>% morph(!!!df_col)

Edit: oops splicing wouldn't work since that's quoting time and the columns are not in scope then. We'd need to use auto-splicing of unnamed tibbles:

mtcars %>% morph(df_col = tibble(cyl, am)) %>% morph(df_col)
mkoohafkan commented 5 years ago

@lionel- thanks for referencing the other discussion. Sounds like the original feature request in this issue has been co-opted into a broader discussion about tidyverse workflows. I admit I'm struggling to see the relationship between the relatively simple feature request described in this issue and the packing/unpacking workflow described in #523.

lionel- commented 5 years ago

Just realised morph() + registered variables might have undesirable effects. For instance, using cols_if() within morph() will trigger column access for the whole tibble. And if we somehow distinguish lexical access and dynamic access, this means that this won't work properly:

data %>% morph(df_col = sels(starts_with("d"))

Unless we find a way to suspend the detection of column access? Then we'd disable it during discovery (mapping predicates over a column) and reenable it while constructing the tibble. In fact it seems that such a mechanism will be generally useful for creating tools compatible with morph.

hadley commented 5 years ago

@mkoohafkan This has never been a simple feature request because it is challenging to make precise existing what "using" a variable means. I don't think the use of morph in tidygraph necessarily prevents it's use here.

@lionel- yeah, we need to distinguish between inspecting a variable to determine whether or not it should be used, and actually "using" it.

krlmlr commented 5 years ago

Inspecting variables in morph_at() or morph_if() (or the new-style equivalents) would/could happen before the actual morph(), so we can distinguish inspection and access? Why would inspection of names trigger an access?

lionel- commented 5 years ago

The new style equivalent will run inside the morph().

Why would inspection of names trigger an access?

We also need to inspect data for predicate-based selections.

hadley commented 4 years ago

A few thoughts on implementation given the recent changes to dplyr internals: Implementing morph() looks to be relatively straightforward — in the binding functions in the data mask, we'd just track record whenever a variable was materialised. (The only trick is to do this in such a way that it doesn't affect ordinary performance, but that's not too hard). @lionel-'s concern about materialising columns doesn't apply to across() since it already avoids materialising all variables by using the full data set in eval_select().

My primary concern is that this will be harder to implement if we switch to using ALTREP slices, but we could probably still have a fallback for morph() that used the old approach.