tidyverse / dplyr

dplyr: A grammar of data manipulation
https://dplyr.tidyverse.org/
Other
4.79k stars 2.12k forks source link

Feature request: A function to check if a set of variables form a unique ID in a dataframe #7098

Closed bholtemeyer closed 1 week ago

bholtemeyer commented 1 month ago

I'd like to have a function to check if a set of variables form a unique ID in a dataframe, like this: https://search.r-project.org/CRAN/refmans/eeptools/html/isid.html

I think this would make code more readable as pipes would not need to be involved.

function would return TRUE or FALSE. TRUE indicates the variables uniquely identify the rows. FALSE indicates they do not.

ggrothendieck commented 1 week ago

If the reason to want this is is so that one can check prior to using mutate(..., .by = ...) to get the effect of rowwise then perhaps it would be better to support something like .by = .ROWID .

A one-liner that calculates isid would be:

isid <- function(data, ...) ! anyDuplicated(data[c(...)])

isid(anscombe) # TRUE
isid(anscombe, "x1", "x2") # TRUE
isid(anscombe, c("x1", "x2")) # TRUE
isid(anscombe, "x4") # FALSE

anscombe
##    x1 x2 x3 x4    y1   y2    y3    y4
## 1  10 10 10  8  8.04 9.14  7.46  6.58
## 2   8  8  8  8  6.95 8.14  6.77  5.76
## 3  13 13 13  8  7.58 8.74 12.74  7.71
## 4   9  9  9  8  8.81 8.77  7.11  8.84
## 5  11 11 11  8  8.33 9.26  7.81  8.47
## 6  14 14 14  8  9.96 8.10  8.84  7.04
## 7   6  6  6  8  7.24 6.13  6.08  5.25
## 8   4  4  4 19  4.26 3.10  5.39 12.50
## 9  12 12 12  8 10.84 9.13  8.15  5.56
## 10  7  7  7  8  4.82 7.26  6.42  7.91
## 11  5  5  5  8  5.68 4.74  5.73  6.89
DavisVaughan commented 1 week ago

I think there are many ways to use existing tools for this, so I think it is a little too niche to make a helper in dplyr for this

library(dplyr)
library(vctrs)

uniquely <- function(...) {
  args <- rlang::list2(...)
  names(args) <- paste0("..", seq_along(args))
  args <- vctrs::new_data_frame(args)
  !vctrs::vec_duplicate_any(args)
}

anscombe |>
  summarise(
    res = !vec_duplicate_any(pick(x1, x2)),
    res2 = uniquely(x1, x2),
    res3 = n_distinct(x1, x2) == nrow(anscombe)
  )
#>    res res2 res3
#> 1 TRUE TRUE TRUE

anscombe |>
  summarise(
    res = !vec_duplicate_any(pick(x4)),
    res2 = uniquely(x4),
    res3 = n_distinct(x4) == nrow(anscombe)
  )
#>     res  res2  res3
#> 1 FALSE FALSE FALSE

See https://github.com/tidyverse/dplyr/issues/6660 for .by = row ideas