Closed ArtemSokolov closed 4 years ago
library(tidyverse)
#iris %>% transmutate(Petal.Area = Petal.Width * Petal.Length)
iris %>%
as_tibble() %>%
mutate(Petal.Area = Petal.Width * Petal.Length) %>%
select(-Petal.Width, -Petal.Length)
#> # A tibble: 150 x 4
#> Sepal.Length Sepal.Width Species Petal.Area
#> <dbl> <dbl> <fct> <dbl>
#> 1 5.1 3.5 setosa 0.280
#> 2 4.9 3 setosa 0.280
#> 3 4.7 3.2 setosa 0.26
#> 4 4.6 3.1 setosa 0.3
#> 5 5 3.6 setosa 0.280
#> 6 5.4 3.9 setosa 0.68
#> 7 4.6 3.4 setosa 0.42
#> 8 5 3.4 setosa 0.3
#> 9 4.4 2.9 setosa 0.280
#> 10 4.9 3.1 setosa 0.15
#> # ... with 140 more rows
Created on 2018-07-21 by the reprex package (v0.2.0).
Thanks, I'm missing this functionality myself occasionally. Instead of parsing the expression, we could detect and record column access in the C++ code.
What should the verb do in a grouped scenario, if some groups access a different set of columns than other groups?
I think there are two natural options: union
- remove all columns that are accessed by at least one group, or intersection
- remove only the columns that are accessed by all groups. It might be nice to be able to specify which of the two should be used, but I'm not sure where such an option would go...
I'd rather support only intersection
, but that might be more difficult to implement than union
.
The biggest problem I see with both options is that type stability is compromised -- the resulting data frame might end up with different columns, depending on the data. Perhaps the safest thing to do would be to raise an error if different columns are accessed for different groups from this verb.
What naming alternatives do we have? It might be difficult to remember the differences between mutate()
, transmute()
and transmutate()
.
I agree that perhaps the appropriate action for accessing different columns across groups is to raise an error. I'm not sure I follow the type stability concerns; if the intersection
of all accessed columns is removed from each group, is that not a consistent transformation of each group?
For naming alternatives, perhaps we can turn to thesaurus: https://www.thesaurus.com/browse/transmute
If it doesn't become too annoying to type, metamorphose()
might be a viable option.
EDIT: A nicer alternative might be alter()
, as taken from https://www.thesaurus.com/browse/mutate
Suppose we have two group types: X and Y, the mutator code for group X accesses column a
, for Y column b
. If only groups of type X are present in the data, column a
is accessed and removed; if both types are present, both columns a
and b
are accessed -- which to remove? Both intersect and union produce results inconsistent with the first scenario.
We create something, but also take something else away. How about trade()
?
Thanks for the example, Kirill. That makes sense, and raising an error seems like the best approach to maintain consistency.
I think I have a slight preference towards alter()
, because it is semantically similar to mutate()
and transmute()
. But trade()
is a good choice as well!
morph()
?
Yes!! morph()
is perfect.
Hi, just chipping in to say I would find this feature very useful too. I like the transmutate()
name, as it is indeed a mixture of the two existing functions. Otherwise, I could see it being implemented as an additional argument to transmute()
, as in drop.all = FALSE
.
Thanks !
Hi! As the originator of the request on StackOverflow, great thanks to Artem for making it work and to all of you for picking up the banner.
Making this an optional argument to transmute()
as Vasco suggests would perfectly represent my initial wish for the function simply to work a little differently on request. And then there's no difficulty in picking a new word.
If you make it a separate function, I like either morph()
or transmutate()
.
I'd also like it better if it was integrated into transmute rather than adding a new verb (which would again be almost a synonym of merge, transform, gather... in English), with a unique additional parameter that would support the intersect
and union
options mentioned above.
It could also be a parameter of mutate
where we'd have keep
either "all"
(default), "new"
(default transmute behavior), "intersection"
or "union"
, and then transmute
would just be a shorthand for mutate with keep = "new"
.
tidygraph
already uses morph
for temporary graph topology modification, and it would be a shame to add this conflict.
I think that a whole new verb for a slight modification to mutate/transmute
is overkill. An argument to transmute
seems pretty reasonable, it just makes the decision of what to do with grouping variables explicit.
I like the idea of the .keep = "all"
argument to mutate()
, or perhaps .remove = "none"
(with options "other"
, "used"
and "used_once"
). But it's rather low priority now.
Based on #3760 it might be nice to allow dropping grouping columns via this function as well.
I totally missed this issue before. I really don't like how this would potentially complexity the C++ code when it's not that difficult to call a select()
afterwards. @hadley ?
I'm not terribly enthusiastic about the idea as it seems complex and potentially brittle. I think it would be best prototyped in another package, and if it was successful there we might consider adding to dplyr.
Such a feature could be provided by an external package, but we'd have to rebuild a lot of logic that's already available in dplyr. We already know which columns have been accessed by a verb (=the materialized promises), we just need to remove these columns.
It's not difficult to add a select()
, but more often than not it's redundant and duplicates code (=column names).
Right.
Although the materialized environment is reset after each column has been treated, and is not dealt with in the hybrid path.
It’s not as much work as i thought but I still prefer explicit select().
Querying which variables were accessed could be a feature of all eval_tidy()
data masks once we have an ALTENV implementation in R.
I initially felt very negatively about this idea, but I've since realised that this is basically the semantics that tidyr uses in functions like nest()
, separate()
, extract()
, unite()
etc.
Might morph()
be a good name for this?
@lionel- Again, morph
is used by tidygraph
and it would be really disappointing to see this feature interfere with that package. Also, I'd argue that morph
is hardly descriptive of the proposed feature. I'd suggest transmutate
or @krlmlr's suggestion trade
.
Also, I still think this functionality could be achieved with an argument to transmute
or mutate
, rather than requiring a whole new verb.
I'd suggest
transmutate
or @krlmlr's suggestiontrade
.
I find those less descriptive of the proposed feature.
this functionality could be achieved with an argument to
transmute
ormutate
We're thinking about a new workflow based on tibble return values where morph()
might play an important part. In that case, an argument would get in the way of important patterns. See https://github.com/tidyverse/tidyr/issues/523#issuecomment-461871357 for an example.
Perhaps narrow()
would work?
narrow()
is not correct because this operation can also widen. For instance, packing and unpacking a df-col:
mtcars %>% morph(df_col = tibble(cyl, am)) %>% morph(!!!df_col)
Edit: oops splicing wouldn't work since that's quoting time and the columns are not in scope then. We'd need to use auto-splicing of unnamed tibbles:
mtcars %>% morph(df_col = tibble(cyl, am)) %>% morph(df_col)
@lionel- thanks for referencing the other discussion. Sounds like the original feature request in this issue has been co-opted into a broader discussion about tidyverse workflows. I admit I'm struggling to see the relationship between the relatively simple feature request described in this issue and the packing/unpacking workflow described in #523.
Just realised morph()
+ registered variables might have undesirable effects. For instance, using cols_if()
within morph()
will trigger column access for the whole tibble. And if we somehow distinguish lexical access and dynamic access, this means that this won't work properly:
data %>% morph(df_col = sels(starts_with("d"))
Unless we find a way to suspend the detection of column access? Then we'd disable it during discovery (mapping predicates over a column) and reenable it while constructing the tibble. In fact it seems that such a mechanism will be generally useful for creating tools compatible with morph.
@mkoohafkan This has never been a simple feature request because it is challenging to make precise existing what "using" a variable means. I don't think the use of morph in tidygraph necessarily prevents it's use here.
@lionel- yeah, we need to distinguish between inspecting a variable to determine whether or not it should be used, and actually "using" it.
Inspecting variables in morph_at()
or morph_if()
(or the new-style equivalents) would/could happen before the actual morph()
, so we can distinguish inspection and access? Why would inspection of names trigger an access?
The new style equivalent will run inside the morph()
.
Why would inspection of names trigger an access?
We also need to inspect data for predicate-based selections.
A few thoughts on implementation given the recent changes to dplyr internals: Implementing morph()
looks to be relatively straightforward — in the binding functions in the data mask, we'd just track record whenever a variable was materialised. (The only trick is to do this in such a way that it doesn't affect ordinary performance, but that's not too hard). @lionel-'s concern about materialising columns doesn't apply to across()
since it already avoids materialising all variables by using the full data set in eval_select()
.
My primary concern is that this will be harder to implement if we switch to using ALTREP slices, but we could probably still have a fallback for morph()
that used the old approach.
Dear
dplyr
developers,A recent Stack Overflow question raised an interesting use case of having columns fed to a
mutate()
call automatically removed from the result. To do this, the mutator would need to parse the input expressions to determine what symbols were used, and I made the first pass at designing such a function. The question author liked my answer and suggested that I contribute it todplyr
.I am happy to work on a PR with a more robust implementation, but I wanted to check with you if such a feature would align with your design principles and the spirit of the package.
Thanks. Big fan of your work. -Artem