tidyverse / tidyr

Tidy Messy Data
https://tidyr.tidyverse.org/
Other
1.39k stars 420 forks source link

Use `tidyselect::eval_select(allow_positional = FALSE)` for `id_cols` when possible #1541

Open bpcolman opened 9 months ago

bpcolman commented 9 months ago

In going through some old code that I hadn't touched in a while, I found that pivot_wider was not working and was giving me the following error:

#> Error in `rethrow_id_cols_oob()`:
#> ! `i` must be a single string, not an integer vector.
#> ℹ This is an internal error that was detected in the tidyr package.
#>   Please report it at <https://github.com/tidyverse/tidyr/issues> with a reprex
#>   (<https://tidyverse.org/help/>) and the full backtrace.

Digging deeper, I think I've sorted out the overarching timing of the issue, and it is due to changes made between tidyr 1.1.4 and 1.2.0 that changes the behavior, and between versions 1.2.1 and 1.3.0 where the error message changed to the above. I've included a reprex below that shows behavior in ver 1.1.4 and 1.3.1:

# Install `dplyr` and `devtools` if they aren't already:
list.of.packages <- c("dplyr", "devtools")
new.packages <- list.of.packages[!(list.of.packages %in% installed.packages()[,"Package"])]
if(length(new.packages)) install.packages(new.packages)

# Load dplyr and devtools packages
library(dplyr)
#> 
#> Attaching package: 'dplyr'
#> The following objects are masked from 'package:stats':
#> 
#>     filter, lag
#> The following objects are masked from 'package:base':
#> 
#>     intersect, setdiff, setequal, union
library(devtools)
#> Loading required package: usethis

# Record installed version of tidyr if there is one, to allow reinstallation of
# the current version, or provide an error if there is not one:
tidy_ver <- packageVersion("tidyr") # record the version number of existing tidyr

detach("package:tidyr", unload = TRUE)
#> Error in detach("package:tidyr", unload = TRUE): invalid 'name' argument

remove.packages("tidyr")
#> Removing package from '/cloud/lib/x86_64-pc-linux-gnu-library/4.3'
#> (as 'lib' is unspecified)

install_version("tidyr", 
                version = "1.1.4", 
                repos = "http://cran.us.r-project.org",
                quiet = TRUE)

library(tidyr)

# Set a seed for consistency
set_seed = 312

# Make some data (albeit with not the tidiest structure)

(df <- data.frame(site = c(rep((1:3), each = 2)),
                 time = rep(c("day", "night"), 3),
                 temp = rnorm(6, rep(c(20, 10), each = 1), 4),
                 day_of_exp = c(rep(1, 2), 
                                rep(11, 2),
                                rep(21, 2))))
#>   site  time      temp day_of_exp
#> 1    1   day 24.681337          1
#> 2    1 night  2.646830          1
#> 3    2   day 19.473059         11
#> 4    2 night 10.124520         11
#> 5    3   day 23.304546         21
#> 6    3 night  7.414085         21

# Pivot data based on site (column 1) and day_of_exp (column 4) using column
# numbers:

df %>% pivot_wider(id_cols = c(1, 4),
                   names_from = time,
                   values_from = temp)
#> # A tibble: 3 × 4
#>    site day_of_exp   day night
#>   <int>      <dbl> <dbl> <dbl>
#> 1     1          1  24.7  2.65
#> 2     2         11  19.5 10.1 
#> 3     3         21  23.3  7.41

# Works well. We can also Pivot data based on site and day_of_exp using
# column names:

df %>% pivot_wider(id_cols = c(site, day_of_exp),
                   names_from = time,
                   values_from = temp)
#> # A tibble: 3 × 4
#>    site day_of_exp   day night
#>   <int>      <dbl> <dbl> <dbl>
#> 1     1          1  24.7  2.65
#> 2     2         11  19.5 10.1 
#> 3     3         21  23.3  7.41

# If for some reason, we include a variable that doesn't exist in the dataset,
# we get a simple error message:

df %>%
  pivot_wider(id_cols = c(1, 5),
              names_from = time,
              values_from = temp)
#> Error in `pivot_wider_spec()`:
#> ! Can't subset columns past the end.
#> ℹ Location 5 doesn't exist.
#> ℹ There are only 4 columns.

# So that gives an error message, and the error message makes it clear that the
# error is caused by choosing a column that doesn't exist. 

# Working with `tidyr` v. 1.3.1, things don't work in the same way.

# Detach and uninstall `tidyr`1.1.3 and install and load 1.3.1:

detach("package:tidyr", unload = TRUE)
remove.packages("tidyr")
#> Removing package from '/cloud/lib/x86_64-pc-linux-gnu-library/4.3'
#> (as 'lib' is unspecified)

install_version("tidyr", 
                version = "1.3.1", 
                repos = "http://cran.us.r-project.org", 
                quiet = TRUE)

library("tidyr")

# In ver 1.3.1, se can try to pivot data wider based on site (column 1) and
# day_of_exp (column 4) using column numbers:

df %>% pivot_wider(id_cols = c(1, 4),
                   names_from = time,
                   values_from = temp)
#> Error in `rethrow_id_cols_oob()`:
#> ! `i` must be a single string, not an integer vector.
#> ℹ This is an internal error that was detected in the tidyr package.
#>   Please report it at <https://github.com/tidyverse/tidyr/issues> with a reprex
#>   (<https://tidyverse.org/help/>) and the full backtrace.

# It does not work, and we get a new error message. We can still use names:

df %>% pivot_wider(id_cols = c(site, day_of_exp),
                   names_from = time,
                   values_from = temp)
#> # A tibble: 3 × 4
#>    site day_of_exp   day night
#>   <int>      <dbl> <dbl> <dbl>
#> 1     1          1  24.7  2.65
#> 2     2         11  19.5 10.1 
#> 3     3         21  23.3  7.41

# If we include a variable that doesn't exist in the dataset, we get the same
# error message as when we tried to call the 4th column:

df %>%
  pivot_wider(id_cols = c(1, 5),
              names_from = time,
              values_from = temp)
#> Error in `rethrow_id_cols_oob()`:
#> ! `i` must be a single string, not an integer vector.
#> ℹ This is an internal error that was detected in the tidyr package.
#>   Please report it at <https://github.com/tidyverse/tidyr/issues> with a reprex
#>   (<https://tidyverse.org/help/>) and the full backtrace.

# It appears that the cause of this error when calling "id_cols" by number is
# that pivot_wider calls the columns from an object where the columns that are
# called in the "names_from" or "values_from" arguments have already been
# removed, leaving a object in this case that that has dimensions 6 x 2 instead
# of 6 x 4. This is corroborated if we call columns 1 and 2, which should call
# our site and day_of_exp variables if time and temp are removed:

df %>%
  pivot_wider(id_cols = c(1, 2),
              names_from = time,
              values_from = temp)
#> # A tibble: 3 × 4
#>    site day_of_exp   day night
#>   <int>      <dbl> <dbl> <dbl>
#> 1     1          1  24.7  2.65
#> 2     2         11  19.5 10.1 
#> 3     3         21  23.3  7.41

# This works, suggesting that this is where the error lies. 

# Detach library, remove tidyr, and reinstall original version if it was already
# installed:

detach("package:tidyr", unload = TRUE)

remove.packages("tidyr")
#> Removing package from '/cloud/lib/x86_64-pc-linux-gnu-library/4.3'
#> (as 'lib' is unspecified)

install_version("tidyr", 
                version = tidy_ver, 
                repos = "http://cran.us.r-project.org",
                quiet = TRUE)

Created on 2024-02-15 with reprex v2.1.0

A workaround is to use the column names in the id_cols argument; however, given older versions of the pivot_wider function allowed the use of column numbers and there is an increased potential for error when entering names, it would be nice if pivot_wider functioned like it did in tidyr 1.1.4 and before. I would also advocate for a return to the more descriptive error message when choosing variables that don't exist if that is possible, and including a descriptive warning/error if one attempts to choose a variable that is included in the names_from or values_from arguments.

If the removal of straightforward selection by column number is a feature and not a bug, I would request an update for the help and vignette to make this even clearer.

rion-saeon commented 7 months ago

In my case, I could specify id_cols like this and it worked. All indexes (col numbers) did not (i.e., when month_n is specified as 8)

pivot_wider(id_cols = 
                c(1:4, month_n), 
              names_from = "records_or_cards", 
              values_from = value)
DavisVaughan commented 1 month ago

Minimal reprex

library(tidyr)

# Set a seed for consistency
set_seed = 312

df <- data.frame(
  site = c(rep((1:3), each = 2)),
  time = rep(c("day", "night"), 3),
  temp = rnorm(6, rep(c(20, 10), each = 1), 4),
  day_of_exp = c(rep(1, 2), rep(11, 2), rep(21, 2))
)

# Columns 2 (time) and 3 (temp) are removed before evaluating the tidyselection
# for `id_cols`
df %>% pivot_wider(
  id_cols = c(1, 4),
  names_from = time,
  values_from = temp
)
#> Error in `rethrow_id_cols_oob()`:
#> ! `i` must be a single string, not an integer vector.
#> ℹ This is an internal error that was detected in the tidyr package.
#>   Please report it at <https://github.com/tidyverse/tidyr/issues> with a reprex
#>   (<https://tidyverse.org/help/>) and the full backtrace.

Comes from:

~I think it is technically possible that we could evaluate the id_cols tidyselection in the context of the full data frame, and then do a post-hoc check that it doesn't collide with any names specified by names_from or values_from (which is why we implement it this way in the first place).~ Actually this would not work, because then everything() would also cause the error to get thrown, and we don't want that.

But I'd argue that using positions to perform the selection here is actually not a great thing to do. You should really prefer selecting with names. Ideally I think we'd keep the current behavior but prevent selecting by position through eval_select()


We've decided to wait for eval_select(allow_positional = FALSE) and use that for id_cols when we can to throw an informative error in this case and force the user to use names.

rion-saeon commented 1 month ago

@DavisVaughan the reason I want to use positions is that it seems more parsimonious and for the lazy coder like me. However, using indices are more risky than specifying names.