tidyverse / dplyr

dplyr: A grammar of data manipulation
https://dplyr.tidyverse.org/
Other
4.78k stars 2.12k forks source link

Feature/Function Request: `is_equal` #7040

Closed awpsoras closed 5 months ago

awpsoras commented 5 months ago

When performing a series of joins and unexpectedly returning more rows than one started with, an is_equal function would be a very useful predicate function within select( where() ).

If you have an alleged primary key but you end up with >1 rows of a primary key, you can use is_equal to find which columns were forcing the duplication of a primary key row.

This will help especially in dbplyr cases during sequential joins where the data have too many columns to visually inspect.

I have written a mini blog post showing some potential versions (in R) and application of this function and published it on Rpubs: https://rpubs.com/seadoo/is_equal.

Below is the use of is_equal in a small table (copied from the post) showing the desired output.

bigtable <- tibble(
  id = c(1, 1, 2, 3, 3, 3),
  val1 = c(12, 12, 13, 14, 15, 15),
  val2 = c("orange", "orange", "carrot", "grape", "grape", "grape"),
  val3 = c("a", "b", "c", "f", "g", "d")
)

## # A tibble: 6 × 4
##      id  val1 val2   val3 
##   <dbl> <dbl> <chr>  <chr>
## 1     1    12 orange a    
## 2     1    12 orange b    
## 3     2    13 carrot c    
## 4     3    14 grape  f    
## 5     3    15 grape  g    
## 6     3    15 grape  d

bigtable %>% 
  group_by(id) %>% 
  filter( n() > 1 ) %>% 
  group_modify(function(dat, key) {
    dat %>% select( !where(is_equal) )
  }) %>% 
  ungroup()

## # A tibble: 5 × 3
##      id val3   val1
##   <dbl> <chr> <dbl>
## 1     1 a        NA
## 2     1 b        NA
## 3     3 f        14
## 4     3 g        15
## 5     3 d        15

If this is supported as an eligible feature, I'd be happy to work on it and write up documentation!

DavisVaughan commented 5 months ago

Have you seen the new relationship argument of joins? I think those may be helpful to you. It sounds like you want one-to-one or one-to-many https://dplyr.tidyverse.org/reference/mutate-joins.html. See also, multiple and unmatched.

DavisVaughan commented 5 months ago

I think this is a bit too specific for dplyr, but it sounds useful for an extension package!

I do think the best way to avoid this is to catch it during the join rather than with post hoc analysis of the result (like with the new join args)

awpsoras commented 5 months ago

Sad, I was hoping it would fit right in with other helper functions like between. Looking further, I suppose n_distinct may also work indirectly. You guys have thought of everything! ... select( !where(~ n_distinct(.x) == 1))

I also definitely agree post hoc is always better!