r-lib / vctrs

Generic programming with typed R vectors
https://vctrs.r-lib.org
Other
287 stars 66 forks source link

`vec_unique` produces named output #1442

Open eutwt opened 3 years ago

eutwt commented 3 years ago

I think it would be helpful to document that vec_unique produces named output if the arguments are named. The help(vec_unique) page says ays that vec_unique is "Equivalent to unique()" and does not mention names. However, in unique() "No attributes are copied (so the result has no names)" (quoted from help(unique)).

Examples:

library(vctrs)
inputs <- list(
  named_first = vec_c(a = 1, 1),
  named_last = vec_c(1, a = 1)
)

str(lapply(inputs, vec_unique))
#> List of 2
#>  $ named_first: Named num 1
#>   ..- attr(*, "names")= chr "a"
#>  $ named_last : Named num 1
#>   ..- attr(*, "names")= chr ""
str(lapply(inputs, unique))
#> List of 2
#>  $ named_first: num 1
#>  $ named_last : num 1

Created on 2021-08-29 by the reprex package (v2.0.1)

lionel- commented 3 years ago

We should either strip them or document them as order-unstable. Probably best to strip them to avoid programming mistakes? Also should be a little faster.

DavisVaughan commented 2 years ago

Reopening because dplyr relies on this behavior right now in relocate(), see https://github.com/r-lib/vctrs/pull/1545

DavisVaughan commented 2 years ago

The other idea is to document vec_unique_loc() and vec_unique() as order stable, and just say that the unique value is the first occurrence of that value (similar to the way vec_match() works).

I now feel like it might be useful to be able to program with a guarantee that you get the first occurrence back (and any names that are attached to it) (i.e. the way it currently works). It has historically been useful with vec_match() to have this guarantee. I can see a case sort of like relocate() where you accept a tidyselection through ... that might allow for renaming, and you want to keep the first occurrence (or last, by reversing first), like verb(df, foo = bar, foo2 = bar) - keeping only the foo = bar.

It also seems inconsistent that we could just do vec_slice(x, vec_unique_loc(x)) as a "work around" if we make it drop names. Like, if that works and has a guarantee that it returns the first occurrence, then we might as well document vec_unique() as doing the same thing.

DavisVaughan commented 2 years ago

And we could go a step further by adding a which = c("first", "last") argument to vec_unique_loc(), which would complement https://github.com/r-lib/vctrs/issues/1239 nicely

# I think:
vec_unique_loc(x, which = "last") == which(!vec_duplicate_detect(x, ignore = "last"))

That would be useful for the verb(df, foo = bar, foo2 = bar) idea if you want to retain only the last duplicate the user provided, i.e. foo2 = bar, which might be more common

vec_unique(x, which = "last") would be useful here: https://github.com/tidyverse/dplyr/pull/6340/files#diff-50c26f2dac897f93822568b679c701271eb374752efb575ce2ded61a30527ea7R62

lionel- commented 2 years ago

Good idea. It'd be nice to have a way to attach other types of metadata than character vectors so that the stability and the which argument may be used more generally. For this we'd have to:

Just an idea though.

Edit: Just realised we could probably do without df_unique() and just handle metadata columns specially in vec_unique().

DavisVaughan commented 2 years ago

I'm not sure I understand this metadata thing, can you give me an example?

lionel- commented 2 years ago

Names are a parallel data structure. They are processed separately and don't enter into account to determine uniqueness.

extra <- c("a", "b", "c")
x <- c(1, 2, 1)

vec <- set_names(x, extra)
vctrs::vec_unique(vec)
#> a b
#> 1 2

However names can only be a character vector. To attach other types of auxiliary data we need a data frame. But then the extra data is taken into account by vec_unique():

# Current behaviour
df <- data.frame(x = x, .extra = extra)
vctrs::vec_unique(df)
#>   x .extra
#> 1 1      a
#> 2 2      b
#> 3 1      c

So we could instead treat a specially named column as data to ignore, just like if they were names:

# Suggested behaviour
vctrs::vec_unique(df)
#>   x .extra
#> 1 1      a
#> 2 2      b

I don't think we should implement this now though. I'm just noting the pattern in case it appears elsewhere so that we could derive something more systematically useful.

DavisVaughan commented 2 years ago

I would store the metadata in an attribute and make the vec_proxy() method promote it to a data frame column (so it gets sliced). And the vec_proxy_equal() method would just use the "core" data object.

library(vctrs)

x <- c(1, 2, 1)
x <- new_vctr(x, metadata = data_frame(foo = c("a", "b", "c")), class = "meta")

vec_proxy.meta <- function(x, ...) {
  # Promote metadata to a df-column for slicing purposes
  metadata <- attr(x, "metadata")
  x <- rlang:::unstructure(x)
  data_frame(x = x, metadata = metadata)
}

vec_restore.meta <- function(x, to, ...) {
  # Undo the proxy
  metadata <- x$metadata
  x <- x$x
  new_vctr(x, metadata = metadata, class = "meta")
}

vec_proxy_equal.meta <- function(x, ...) {
  # Defaults to `vec_proxy()` method so we override that to only use the core data
  rlang:::unstructure(x)
}

# Print method is weird because `format.vctrs_vctr` calls `vec_data()` which
# uses `vec_proxy()` and I'm not sure that is right for printing purposes.
x
#> <meta[3]>
#>      
#> 1 1 a
#> 2 2 b
#> 3 1 c

vec_proxy(x)
#>   x foo
#> 1 1   a
#> 2 2   b
#> 3 1   c

vec_restore(vec_proxy(x), x)
#> <meta[3]>
#>      
#> 1 1 a
#> 2 2 b
#> 3 1 c

vec_proxy_equal(x)
#> [1] 1 2 1

# Using `vec_proxy_equal()`
vec_unique(x)
#> <meta[2]>
#>      
#> 1 1 a
#> 2 2 b

# Using `vec_proxy()`
vec_slice(x, c(1, 3, 1, 2))
#> <meta[4]>
#>      
#> 1 1 a
#> 2 1 c
#> 3 1 a
#> 4 2 b

Created on 2022-07-20 by the reprex package (v2.0.1)

eutwt commented 2 years ago

For reference data.table has a by argument to the unique.data.table function that does something like what's described by @lionel- (if I'm following), but with the data/metadata distinction provided by the user as an argument instead of determined from column names. Personally I've found this function useful.

library(data.table)
dt <- data.table(a = c(1, 2, 1), b = c('a', 'b', 'c'))

unique(dt, by = 'a')
#>        a      b
#>    <num> <char>
#> 1:     1      a
#> 2:     2      b
unique(dt, by = 'a', fromLast = TRUE)
#>        a      b
#>    <num> <char>
#> 1:     2      b
#> 2:     1      c

Created on 2022-07-20 by the reprex package (v2.0.1)

DavisVaughan commented 2 years ago

@eutwt in my head that was the reason we exposed vec_unique_loc(), so you can do this:

library(vctrs)

df <- data_frame(a = c(1, 2, 1), b = c('a', 'b', 'c'))

vec_slice(df, vec_unique_loc(df["a"]))
#>   a b
#> 1 1 a
#> 2 2 b

From a developer point of view I feel like that API is pretty nice, and separates powers well without adding more arguments

lionel- commented 2 years ago

I don’t think it is practical to create a one off class and register methods just to perform a data operation. Maybe you misunderstood what I meant by meta data? This would be the sort of data that you store in a bare data frame.

On Wed, 20 Jul 2022 at 14:50, Davis Vaughan @.***> wrote:

I would store the metadata in an attribute and make the vec_proxy() method promote it to a data frame column. And the vec_proxy_equal() method would just use the "core" data object.

library(vctrs) x <- c(1, 2, 1)x <- new_vctr(x, metadata = data_frame(foo = c("a", "b", "c")), class = "meta") vec_proxy.meta <- function(x, ...) {

Promote metadata to a df-column for slicing purposes

metadata <- attr(x, "metadata") x <- rlang:::unstructure(x) data_frame(x = x, metadata = metadata) } vec_restore.meta <- function(x, to, ...) {

Undo the proxy

metadata <- x$metadata x <- x$x new_vctr(x, metadata = metadata, class = "meta") } vec_proxy_equal.meta <- function(x, ...) {

Defaults to vec_proxy() method so we override that to only use the core data

rlang:::unstructure(x) }

Print method is weird because format.vctrs_vctr calls vec_data() which# uses vec_proxy() and I'm not sure that is right for printing purposes.x#> <meta[3]>#> #> 1 1 a#> 2 2 b#> 3 1 c

vec_proxy(x)#> x foo#> 1 1 a#> 2 2 b#> 3 1 c

vec_restore(vec_proxy(x), x)#> <meta[3]>#> #> 1 1 a#> 2 2 b#> 3 1 c

vec_proxy_equal(x)#> [1] 1 2 1

Using vec_proxy_equal()

vec_unique(x)#> <meta[2]>#> #> 1 1 a#> 2 2 b

Using vec_proxy()

vec_slice(x, c(1, 3, 1, 2))#> <meta[4]>#> #> 1 1 a#> 2 1 c#> 3 1 a#> 4 2 b

Created on 2022-07-20 by the reprex package https://reprex.tidyverse.org (v2.0.1)

— Reply to this email directly, view it on GitHub https://github.com/r-lib/vctrs/issues/1442#issuecomment-1190247326, or unsubscribe https://github.com/notifications/unsubscribe-auth/ABCCDGVCYL5ODPGANGZVNJ3VU7YZJANCNFSM5DAPILQA . You are receiving this because you commented.Message ID: @.***>