tidyverse / dplyr

dplyr: A grammar of data manipulation
https://dplyr.tidyverse.org/
Other
4.75k stars 2.12k forks source link

`summarise()` breaks when using `across()` with name collisions #7000

Closed cobac closed 6 months ago

cobac commented 6 months ago

I run into this behavior the other day, which I have to assume is a bug.

Running on dplyr 1.1.4, R 4.3.2.

When performing some aggregations with summarise(), if the name of the new columns contain the name of some of the original columns, the function breaks.

  library(tidyverse)
  table  <- tibble(word = runif(100), x = runif(100)) 

The new columns e.g. max_word contain the name of the original column word.

  table  |>
    summarise(word_median = median(word), 
              max_word = max(word),
              min_word= min(word),
              across(contains("ord"), n_distinct))
# A tibble: 1 × 4
  word_median max_word min_word  word
        <int>    <int>    <int> <int>
1           1        1        1   100

If they don't, the function works as expected.

  table  |>
    summarise(wird_median = median(word), 
              max_wird = max(word),
              min_wird= min(word),
              across(contains("ord"), n_distinct))
# A tibble: 1 × 4
  wird_median max_wird min_wird  word
        <dbl>    <dbl>    <dbl> <int>
1       0.493    0.996   0.0103   100

This is only the case when using across() within summarise().

  table  |>
    summarise(word_median = median(word), 
              max_word = max(word),
              min_word= min(word))
 # A tibble: 1 × 3
  word_median max_word min_word
        <dbl>    <dbl>    <dbl>
1       0.493    0.996   0.0103
psychelzh commented 6 months ago

Maybe this is not an issue of dplyr side? It is one feature that summarise() can use variable generated by previous expressions. For example, this will work:

table  |>
    summarise(word_median = median(word), n = length(word_median))

The workaround might be call across() first, and then your other three expressions.

cobac commented 6 months ago

is one feature that summarise() can use variable generated by previous expressions

Oh, I was not aware of that. Thanks.