nathaneastwood / poorman

A poor man's dependency free grammar of data manipulation
https://nathaneastwood.github.io/poorman/
Other
339 stars 15 forks source link

`select` is not consistent with `dplyr::select` when used on data.frame with duplicate column names #92

Open TimTeaFan opened 3 years ago

TimTeaFan commented 3 years ago

I was playing around with data.frames with duplicate column names and stumbled upon this inconsistency with {dplyr}:

library(dplyr)
dat <- data.frame(a = 1, b = 2, a = 3, check.names = FALSE) 

dat %>% poorman::select(a)
#>   a
#> 1 1

dat %>% dplyr::select(a)
#> Error: Names must be unique.
#> x These names are duplicated:
#>   * "a" at locations 1 and 2.

Created on 2021-05-24 by the reprex package (v0.3.0)

The question is: is {poorman} supposed be 100% consistent with {dplyr}?

If yes then poorman::select should throw an error as well.

On the other hand, {poorman} - unlike {dplyr} - might not be bound in the same way to the concept of tidy data, and it would be nice to have a go-to package when dealing with untidy data.frame's. In this case both a columns should be selected.

Regarding mutate the behavior differs as well:

dat %>% poorman::mutate(c = 4)
#>   a b a.1 c
#> 1 1 2   3 4

dat %>% dplyr::mutate(c = 4)
#> Error: Can't transform a data frame with duplicate names.

It seems like mutate automatically uses check.names = TRUE and renames the duplicate column name without notice. In this case an error might be preferable (or as an alternative, the column names could be left untouched).

Created on 2021-05-24 by the reprex package (v0.3.0)

I didn't consider this to be a "bug", so I opened a blank issue.

nathaneastwood commented 3 years ago

Hi @TimTeaFan, thanks for submitting this issue - it's an interesting one. I would say that given {dplyr} fails in these instances, {poorman} should also fail. My initial curiosity lies in wondering where this fails within {dplyr}. Is it an issue from {dplyr} itself, {tibble} or maybe {tidyselect}? Once I know that, I will be better placed to understand where {poorman} should capture and handle this type of issue. I will do some digging and get back to you!

TimTeaFan commented 3 years ago

Regarding dplyr::select the issue is caused by tidyselect::eval_select. I digged into this a little in this SO answer. Regarding dplyr::mutate I'm not sure if this is caused by {tidyselect}.