Open eutwt opened 3 years ago
We should either strip them or document them as order-unstable. Probably best to strip them to avoid programming mistakes? Also should be a little faster.
Reopening because dplyr relies on this behavior right now in relocate()
, see https://github.com/r-lib/vctrs/pull/1545
The other idea is to document vec_unique_loc()
and vec_unique()
as order stable, and just say that the unique value is the first occurrence of that value (similar to the way vec_match()
works).
I now feel like it might be useful to be able to program with a guarantee that you get the first occurrence back (and any names that are attached to it) (i.e. the way it currently works). It has historically been useful with vec_match()
to have this guarantee. I can see a case sort of like relocate()
where you accept a tidyselection through ...
that might allow for renaming, and you want to keep the first occurrence (or last, by reversing first), like verb(df, foo = bar, foo2 = bar)
- keeping only the foo = bar
.
It also seems inconsistent that we could just do vec_slice(x, vec_unique_loc(x))
as a "work around" if we make it drop names. Like, if that works and has a guarantee that it returns the first occurrence, then we might as well document vec_unique()
as doing the same thing.
And we could go a step further by adding a which = c("first", "last")
argument to vec_unique_loc()
, which would complement https://github.com/r-lib/vctrs/issues/1239 nicely
# I think:
vec_unique_loc(x, which = "last") == which(!vec_duplicate_detect(x, ignore = "last"))
That would be useful for the verb(df, foo = bar, foo2 = bar)
idea if you want to retain only the last duplicate the user provided, i.e. foo2 = bar
, which might be more common
vec_unique(x, which = "last")
would be useful here: https://github.com/tidyverse/dplyr/pull/6340/files#diff-50c26f2dac897f93822568b679c701271eb374752efb575ce2ded61a30527ea7R62
Good idea. It'd be nice to have a way to attach other types of metadata than character vectors so that the stability and the which
argument may be used more generally. For this we'd have to:
Add df variants, e.g. df_unique()
.
Allow metadata columns in data frames. We'd ignore a column named e.g. .extra
just like vec_unique()
ignores names but still attaches them.
If vec_unique_loc()
preserves names, then the result of df_unique_loc()
should also preserve metadata and return a two-column df containing .loc
and .extra
.
Just an idea though.
Edit: Just realised we could probably do without df_unique()
and just handle metadata columns specially in vec_unique()
.
I'm not sure I understand this metadata thing, can you give me an example?
Names are a parallel data structure. They are processed separately and don't enter into account to determine uniqueness.
extra <- c("a", "b", "c")
x <- c(1, 2, 1)
vec <- set_names(x, extra)
vctrs::vec_unique(vec)
#> a b
#> 1 2
However names can only be a character vector. To attach other types of auxiliary data we need a data frame. But then the extra data is taken into account by vec_unique()
:
# Current behaviour
df <- data.frame(x = x, .extra = extra)
vctrs::vec_unique(df)
#> x .extra
#> 1 1 a
#> 2 2 b
#> 3 1 c
So we could instead treat a specially named column as data to ignore, just like if they were names:
# Suggested behaviour
vctrs::vec_unique(df)
#> x .extra
#> 1 1 a
#> 2 2 b
I don't think we should implement this now though. I'm just noting the pattern in case it appears elsewhere so that we could derive something more systematically useful.
I would store the metadata in an attribute and make the vec_proxy()
method promote it to a data frame column (so it gets sliced). And the vec_proxy_equal()
method would just use the "core" data object.
library(vctrs)
x <- c(1, 2, 1)
x <- new_vctr(x, metadata = data_frame(foo = c("a", "b", "c")), class = "meta")
vec_proxy.meta <- function(x, ...) {
# Promote metadata to a df-column for slicing purposes
metadata <- attr(x, "metadata")
x <- rlang:::unstructure(x)
data_frame(x = x, metadata = metadata)
}
vec_restore.meta <- function(x, to, ...) {
# Undo the proxy
metadata <- x$metadata
x <- x$x
new_vctr(x, metadata = metadata, class = "meta")
}
vec_proxy_equal.meta <- function(x, ...) {
# Defaults to `vec_proxy()` method so we override that to only use the core data
rlang:::unstructure(x)
}
# Print method is weird because `format.vctrs_vctr` calls `vec_data()` which
# uses `vec_proxy()` and I'm not sure that is right for printing purposes.
x
#> <meta[3]>
#>
#> 1 1 a
#> 2 2 b
#> 3 1 c
vec_proxy(x)
#> x foo
#> 1 1 a
#> 2 2 b
#> 3 1 c
vec_restore(vec_proxy(x), x)
#> <meta[3]>
#>
#> 1 1 a
#> 2 2 b
#> 3 1 c
vec_proxy_equal(x)
#> [1] 1 2 1
# Using `vec_proxy_equal()`
vec_unique(x)
#> <meta[2]>
#>
#> 1 1 a
#> 2 2 b
# Using `vec_proxy()`
vec_slice(x, c(1, 3, 1, 2))
#> <meta[4]>
#>
#> 1 1 a
#> 2 1 c
#> 3 1 a
#> 4 2 b
Created on 2022-07-20 by the reprex package (v2.0.1)
For reference data.table has a by
argument to the unique.data.table
function that does something like what's described by @lionel- (if I'm following), but with the data/metadata distinction provided by the user as an argument instead of determined from column names. Personally I've found this function useful.
library(data.table)
dt <- data.table(a = c(1, 2, 1), b = c('a', 'b', 'c'))
unique(dt, by = 'a')
#> a b
#> <num> <char>
#> 1: 1 a
#> 2: 2 b
unique(dt, by = 'a', fromLast = TRUE)
#> a b
#> <num> <char>
#> 1: 2 b
#> 2: 1 c
Created on 2022-07-20 by the reprex package (v2.0.1)
@eutwt in my head that was the reason we exposed vec_unique_loc()
, so you can do this:
library(vctrs)
df <- data_frame(a = c(1, 2, 1), b = c('a', 'b', 'c'))
vec_slice(df, vec_unique_loc(df["a"]))
#> a b
#> 1 1 a
#> 2 2 b
From a developer point of view I feel like that API is pretty nice, and separates powers well without adding more arguments
I don’t think it is practical to create a one off class and register methods just to perform a data operation. Maybe you misunderstood what I meant by meta data? This would be the sort of data that you store in a bare data frame.
On Wed, 20 Jul 2022 at 14:50, Davis Vaughan @.***> wrote:
I would store the metadata in an attribute and make the vec_proxy() method promote it to a data frame column. And the vec_proxy_equal() method would just use the "core" data object.
library(vctrs) x <- c(1, 2, 1)x <- new_vctr(x, metadata = data_frame(foo = c("a", "b", "c")), class = "meta") vec_proxy.meta <- function(x, ...) {
Promote metadata to a df-column for slicing purposes
metadata <- attr(x, "metadata") x <- rlang:::unstructure(x) data_frame(x = x, metadata = metadata) } vec_restore.meta <- function(x, to, ...) {
Undo the proxy
metadata <- x$metadata x <- x$x new_vctr(x, metadata = metadata, class = "meta") } vec_proxy_equal.meta <- function(x, ...) {
Defaults to
vec_proxy()
method so we override that to only use the core datarlang:::unstructure(x) }
Print method is weird because
format.vctrs_vctr
callsvec_data()
which# usesvec_proxy()
and I'm not sure that is right for printing purposes.x#> <meta[3]>#> #> 1 1 a#> 2 2 b#> 3 1 cvec_proxy(x)#> x foo#> 1 1 a#> 2 2 b#> 3 1 c
vec_restore(vec_proxy(x), x)#> <meta[3]>#> #> 1 1 a#> 2 2 b#> 3 1 c
vec_proxy_equal(x)#> [1] 1 2 1
Using
vec_proxy_equal()
vec_unique(x)#> <meta[2]>#> #> 1 1 a#> 2 2 b
Using
vec_proxy()
vec_slice(x, c(1, 3, 1, 2))#> <meta[4]>#> #> 1 1 a#> 2 1 c#> 3 1 a#> 4 2 b
Created on 2022-07-20 by the reprex package https://reprex.tidyverse.org (v2.0.1)
— Reply to this email directly, view it on GitHub https://github.com/r-lib/vctrs/issues/1442#issuecomment-1190247326, or unsubscribe https://github.com/notifications/unsubscribe-auth/ABCCDGVCYL5ODPGANGZVNJ3VU7YZJANCNFSM5DAPILQA . You are receiving this because you commented.Message ID: @.***>
I think it would be helpful to document that
vec_unique
produces named output if the arguments are named. Thehelp(vec_unique)
page says ays thatvec_unique
is "Equivalent tounique()
" and does not mention names. However, inunique()
"No attributes are copied (so the result has no names)" (quoted fromhelp(unique)
).Examples:
Created on 2021-08-29 by the reprex package (v2.0.1)