r-lib / vctrs

Generic programming with typed R vectors
https://vctrs.r-lib.org
Other
289 stars 66 forks source link

vec_rbind() enforces row names when input is a named list #966

Closed thomasp85 closed 4 years ago

thomasp85 commented 4 years ago

When passing in a named list of data frames without row names to vec_rbind() the return value is a data frame with row names. This change can potentially trigger a hard-to-debug renaming message, e.g.

dfs <- list(a = data.frame(x = 1), a = data.frame(x = 1))
vctrs::vec_rbind(!!!dfs)
#> New names:
#> * a -> a...1
#> * a -> a...2
#>       x
#> a...1 1
#> a...2 1
lionel- commented 4 years ago

Both vec_cbind() and vec_rbind() are sensitive to input names, so you need to unname the list.

Would with_name_repair_errors() help finding the source of the repair messages?

thomasp85 commented 4 years ago

I understand the reason, but this spills over into dplyr::bind_rows() where you suddenly get overwhelmed with name repair messages even though you pass in tibbles (that doesn’t have row names). I think this will end up confusing a huge amount of users (it confused both me and @Hadley for a reasonable amount of time).

I think a lot of people creates tibbles in an lapply call and binds them together. lapply might decide to name the elements and this will sometimes trigger a name repair

DavisVaughan commented 4 years ago

CRAN dplyr didn't set row names using the input names, and only used them when .id was set.

I imagine we just need to tweak bind_rows() to something like

if (is_null(.id)) {
  names(dots) <- NULL
} else {
  if (!is_string(.id)) {
    bad_args(".id", "must be a scalar string, ", "not {friendly_type_of(.id)} of length {length(.id)}")
  }
  if (!is_named(dots)) {
    names(dots) <- seq_along(dots)
  }
}
thomasp85 commented 4 years ago

If this is fixed in dplyr instead I’m good with it. I opened here at the request of Hadley

lionel- commented 4 years ago

Maybe we could support a sentinel for easily zapping the names, that bind_rows() could use.

vec_rbind(.names_to = "") or vec_rbind(.names_to = zap())

hadley commented 4 years ago

Alternatively, maybe this behaviour could occur only if the data frame already had row names?

This is a case where I worry that applying the vector naming rules to row names might lead us down a painful path.

lionel- commented 4 years ago

I think this would be confusing, I'd prefer disabling the rows-naming altogether.

Applying this behaviour in an existing function like bind_cols() can definitely cause problems. In a new function this is less clear. It might actually be positive. Since vec_cbind() is also sensitive to input names, it is good practice to think about names when splicing a list in one of the bind functions.

hadley commented 4 years ago

Why aren't these the same case?

vctrs::vec_c(a = c(1, 2, 3))
#> Error: Can't merge the outer name `a` with a vector of length > 1.
#> Please supply a `.name_spec` specification.

vctrs::vec_rbind(a = data.frame(1:3))
#>    X1.3
#> a1    1
#> a2    2
#> a3    3

Created on 2020-03-30 by the reprex package (v0.3.0)

lionel- commented 4 years ago

It's a good idea to make these consistent. This wouldn't help with Thomas' example with 1-row data frames though.

hadley commented 4 years ago

For @thomasp85's original problem, just improving the name repair would help:

library(vctrs)

vec_rbind(
  `1` = data.frame(x = 1:11),
  `11` = data.frame(x = 1)
)
#> New names:
#> * `11` -> `11...1`
#> * `11` -> `11...12`
#>          x
#> 11...1   1
#> 12       2
#> 13       3
#> 14       4
#> 15       5
#> 16       6
#> 17       7
#> 18       8
#> 19       9
#> 110     10
#> 111     11
#> 11...12  1

Created on 2020-03-30 by the reprex package (v0.3.0)

But I really do think it should be harder to opt into row names

lionel- commented 4 years ago

Making the sentinel zap() the default of .names_to and adding a .name_spec parameter would solve the issue.