tidyverse / dplyr

dplyr: A grammar of data manipulation
https://dplyr.tidyverse.org/
Other
4.78k stars 2.12k forks source link

dplyr functions change variables for no reason since dplyr_0.8.0 #4221

Closed mariodejung closed 5 years ago

mariodejung commented 5 years ago

I couldn't thing about a better title, but I run into problems since the new dplyr release.

I narrowed it down to a single dplyr call, but it changes different variables in my script, even they are not "touched" by the dplyr call.

First of all, there is a group_by_at call, because 'if there is a column "species", I want to group by it'. If the column does not exist, I get a warning, which was fine for me, but I don't understand the class changes for the other variables. This bringt problems downstream in my script because older functions can't handle the tibble yet.

library(dplyr)

set.seed(1)
# df <- data.frame(species=c('a','b'),
#                        Intensity=rnorm(1000, 25, 3))
df <- data.frame(Intensity=rnorm(1000, 25, 3))
class(df)
df_backup <- df

df_test <- 
  df %>% 
  dplyr::group_by_at(vars(matches('^species$'))) %>%
  dplyr::summarise(`5%`=stats::quantile(log10(Intensity),.05),
                   `50%`=stats::quantile(log10(Intensity),.50),
                   `95%`=stats::quantile(log10(Intensity),.95)) 
class(df)
class(df_test)
class(df_backup)
batpigandme commented 5 years ago

Hmm, that is strange. FFR, if you can run your example through reprex, it's helpful to see the input and the output right in the issue.

library(dplyr)

set.seed(1)
df <- data.frame(Intensity=rnorm(1000, 25, 3))
class(df)
#> [1] "data.frame"
df_backup <- df
class(df_backup)
#> [1] "data.frame"
df_test <- df %>% 
  dplyr::group_by_at(vars(matches('^species$'))) %>%
  dplyr::summarise(`5%`=stats::quantile(log10(Intensity),.05),
                   `50%`=stats::quantile(log10(Intensity),.50),
                   `95%`=stats::quantile(log10(Intensity),.95)) 
class(df)
#> [1] "tbl_df"     "tbl"        "data.frame"
class(df_test)
#> [1] "tbl_df"     "tbl"        "data.frame"
class(df_backup)
#> [1] "tbl_df"     "tbl"        "data.frame"

library(lobstr)
obj_addr(df)
#> [1] "0x7f86a3fb95d8"
obj_addr(df_backup)
#> [1] "0x7f86a3fb95d8"
obj_addr(df_test)
#> [1] "0x7f86a48584a8"

Created on 2019-02-25 by the reprex package (v0.2.1)

I added the object address code from Binding basics in Advanced R. df and df_backup are just two names bound to the same value, but that is surprising that the copy on modify doesn't preserve df_backup as it was….

binding-basics

mariodejung commented 5 years ago

I even use this call within a function and it changes the objects outside the function!

library(dplyr)

set.seed(1)
df<- data.frame(Intensity=rnorm(1000, 25, 3))
class(df)
#> [1] "data.frame"
df_backup <- df
class(df_backup)
#> [1] "data.frame"
my_plotAbundanceRank <- function(data_set) {
    quantile_df <- 
        data_set %>% 
        dplyr::group_by_at(vars(matches('^species$'))) %>%
        dplyr::summarise(`5%`=stats::quantile(log10(Intensity),.05),
                         `50%`=stats::quantile(log10(Intensity),.50),
                         `95%`=stats::quantile(log10(Intensity),.95)) 
}
print(my_plotAbundanceRank(df))
#> # A tibble: 1 x 3
#>    `5%` `50%` `95%`
#>   <dbl> <dbl> <dbl>
#> 1  1.30  1.40  1.48
class(df)
#> [1] "tbl_df"     "tbl"        "data.frame"
class(df_backup)
#> [1] "tbl_df"     "tbl"        "data.frame"
DavisVaughan commented 5 years ago

I think we just need a shallow copy in the case when no groups are specified. https://github.com/tidyverse/dplyr/blob/16125d12d809286ff2f18be8187b036e9ddbbc0e/src/group_indices.cpp#L645

Basically it should do what ungroup_grouped_df() does: https://github.com/tidyverse/dplyr/blob/16125d12d809286ff2f18be8187b036e9ddbbc0e/src/group_indices.cpp#L669

suppressPackageStartupMessages(library(dplyr))
#> Warning: package 'dplyr' was built under R version 3.5.2

x <- data.frame(y = 1)

x
#>   y
#> 1 1

dplyr::group_by(x)
#> # A tibble: 1 x 1
#>       y
#>   <dbl>
#> 1     1

x
#> # A tibble: 1 x 1
#>       y
#>   <dbl>
#> 1     1
suppressPackageStartupMessages(library(dplyr))
#> Warning: package 'dplyr' was built under R version 3.5.2

x <- data.frame(y = 1)

x
#>   y
#> 1 1

dplyr::group_by(x, y)
#> # A tibble: 1 x 1
#> # Groups:   y [1]
#>       y
#>   <dbl>
#> 1     1

x
#>   y
#> 1 1
lock[bot] commented 5 years ago

This old issue has been automatically locked. If you believe you have found a related problem, please file a new issue (with reprex) and link to this issue. https://reprex.tidyverse.org/