r-lib / vctrs

Generic programming with typed R vectors
https://vctrs.r-lib.org
Other
285 stars 66 forks source link

vec_get() #141

Open DavisVaughan opened 5 years ago

DavisVaughan commented 5 years ago

As a complement to vec_slice() @romainfrancois and I have been discussing the utility of a function that would extract 1 observation, but would be more of an analog to [[ than [.

Possible implementation:

```r vec_rip <- function (x, i) { if (is.logical(i)) { i <- which(i) } stopifnot(is.integer(i) || is.character(i)) # this is new stopifnot(length(i) == 1L) if (is.null(x)) { NULL } else if (is.data.frame(x)) { out <- lapply(x, `[`, i) vec_restore(out, x) } else if (is_vector(x)) { d <- vec_dims(x) if (d == 1) { # this is all that changed? if (is.object(x)) { # x[i] x[[i]] } else { .subset2(x, i) # x[[i, drop = FALSE]] } } else if (d == 2) { x[i, , drop = FALSE] } else { miss_args <- rep(list(missing_arg()), d - 1) eval_bare(expr(x[i, !!!miss_args, drop = FALSE])) } } else { stop("`x` must be a vector", call. = FALSE) } } ```


It would function somewhat like:

vec_rip(list(x = 1, y = 2), 1L)
#> [1] 1
vec_slice(list(x = 1, y = 2), 1L)
#> $x
#> [1] 1

# Not sure about what these should do:
vec_rip(mtcars, 1L)
#>   mpg cyl disp  hp drat   wt  qsec vs am gear carb
#> 1  21   6  160 110  3.9 2.62 16.46  0  1    4    4
vec_rip(matrix(1:12, nrow = 3), 1L)
#>      [,1] [,2] [,3] [,4]
#> [1,]    1    4    7   10

vec_slice(starwars$films, 1L)
#> [[1]]
#> [1] "Revenge of the Sith"     "Return of the Jedi"     
#> [3] "The Empire Strikes Back" "A New Hope"             
#> [5] "The Force Awakens"

vec_rip(starwars$films, 1L)
#> [1] "Revenge of the Sith"     "Return of the Jedi"     
#> [3] "The Empire Strikes Back" "A New Hope"             
#> [5] "The Force Awakens"

# vec_rip() can only extract 1 observation at a time
vec_slice(starwars$films, 1:2)
#> [[1]]
#> [1] "Revenge of the Sith"     "Return of the Jedi"     
#> [3] "The Empire Strikes Back" "A New Hope"             
#> [5] "The Force Awakens"      
#> 
#> [[2]]
#> [1] "Attack of the Clones"    "The Phantom Menace"     
#> [3] "Revenge of the Sith"     "Return of the Jedi"     
#> [5] "The Empire Strikes Back" "A New Hope"

vec_rip(starwars$films, 1:2)
#> Error in vec_rip(starwars$films, 1:2): length(i) == 1L is not TRUE

~We are currently undecided on what it "should" do for data frames and matrices. A couple ideas:~

~1) Return the 1 row observation as is (so the same as vec_slice(), this is shown above). This doesn't feel right. 2) Extract the 1 row observation, then coerce it to some lower level type. For data.frames, a list and for matrices, a vector.~

~If 2) is chosen, one question that came up is "what should it do for list columns"? Two possibilities:~

data("starwars", package = "dplyr")
starwars <- starwars[,c("species", "films")]
one_ob <- vctrs::vec_slice(starwars, 1L)

# don't drop a dimension
as.list(one_ob)
#> $species
#> [1] "Human"
#> 
#> $films
#> $films[[1]]
#> [1] "Revenge of the Sith"     "Return of the Jedi"     
#> [3] "The Empire Strikes Back" "A New Hope"             
#> [5] "The Force Awakens"

# drop 1 dimension
lapply(one_ob, .subset2, 1)
#> $species
#> [1] "Human"
#> 
#> $films
#> [1] "Revenge of the Sith"     "Return of the Jedi"     
#> [3] "The Empire Strikes Back" "A New Hope"             
#> [5] "The Force Awakens"

# but with better support for this
# so it doesn't unravel (i.e. better support 
# for vctr rcrd types)
.subset2(as.POSIXlt(Sys.time()), 1)
#> [1] 26.99161
romainfrancois commented 5 years ago

This would be useful in purrr::map() situations when you iterate on some custom vctr or rcrd type

romainfrancois commented 5 years ago

Perhaps related to https://github.com/tidyverse/dplyr/issues/3789

hadley commented 5 years ago

To get this included in vctrs, you need to answer the question: What are the invariants that this function would satisfy?

jennybc commented 5 years ago

A word I would like to insert into this discussion is "atom". As in, atomic vectors have atoms that are of the same type. I think maybe vec_rip() gets one atom? I don't have great words around it, but part of what makes list-columns hard for people is that it often feels like you're dealing with one extra layer of indexing. I see the appeal of working this out.

romainfrancois commented 5 years ago

I've been staring at this for some time now, and yes this comes from iterating on list columns (from purrr, or my toy zap).

Not sure about this, here's a draft:

- atom_type(vctrs<T>) = vctrs<T>
- atom_type(list_of<T>) = <T>

- vec_size(atom_type(vctrs<T>)) == 1
- vec_size(atom_type(list_of<T>)) == <??>

- type( !!!atom_type<T>) = <T>
DavisVaughan commented 5 years ago

There was a bit of a misunderstanding between @romainfrancois and I regarding list column behavior. I've marked out the relevant info above.

At this point, I think the implementation is just:

vec_rip <- function(x, i) {
  stopifnot(length(i) == 1L)
  out <- vec_slice(x, i)

  if (rlang::is_bare_list(x)) {
    out <- out[[1]]
  }

  out
}

I think this all makes sense in the context of atoms if you think of them as "a basic building block." Notice that for "a list of elements", the building block is a single element and not a list of length 1 (which aligns with what vec_rip() is trying to achieve).

The basic building block of a vector of doubles is a single double.
The basic building block of a list of elements is a single element.
The basic building block of a data frame is a single row.
The basic building block of a matrix is ? (a row or a single entry?)
The basic building block of an array is ?

I'm a little unsure of what the "atom" of a matrix should be. 1 row would make sense, but every element in the matrix is of the same type so 1 single element might also make sense. Same for arrays.

The only other caveat is that a single rcrd should be treated as 1 object, and not as a list of elements.

romainfrancois commented 5 years ago

Could be vec_atom(). The way i think about this is this function should return something sensible for when you iterate on « observations » of the vctr type.

lionel- commented 5 years ago

At this point, I think the implementation is just:

Not quite because extracting an observation from an atomic vector should zap the names and all vector-related attributes.

I think instead of atom, we should think about observations. Then this function would better be called vec_obs() and would be what a purrr::map_obs() mapper would use to subset its input.

lionel- commented 5 years ago

Not quite because extracting an observation from an atomic vector should zap the names and all vector-related attributes.

Hmm... Is that true? What would an observation from a factor be in that case? A level?

factor(c("a", "b"))[[1]]
#> [1] a
#> Levels: a b

# Does that make sense?
vec_obs(factor(c("a", "b")), 1)
#> [1] "a"
lionel- commented 5 years ago

Taking an observation from a data frame is a type-preserving operation. Taking an observation from a list or list_of is not? This problem is harder than I thought. Maybe, somehow, because data frames are atomic (scalars), when taken as a whole?

Slight connection to why flatten(list(mtcars, mtcars)) is (or should be) a no-op, the list is already flat because mtcars is a scalar.

@jennybc's intuition that the notion atomicity is important was spot on.

Still it's not clear what parts of the type should be dropped/preserved when taking an observation of an atomic object. Should we trust the designers of S that vector names are dropped under [[? Has an observation of a matrix lost its row name?

lionel- commented 5 years ago

If a data frame can be seen as an atomic vector, when subsetting observations, this would imply that there should be a notion of missing row. Is a missing row a row full of missings? Do we need a predicate are_na() (using plural to denote vectorisation) that would return TRUE for rows that have only missing observations? Can it be that for atomic types implemented with recursive data structures, is_na(x) is every(x, is_na)?

Interestingly:

mtcars[1, ][NA, ]
#>    mpg cyl disp hp drat wt qsec vs am gear carb
#> NA  NA  NA   NA NA   NA NA   NA NA NA   NA   NA
hadley commented 5 years ago

(@lionel- that's just mtcars[NA_integer_, ] or vec_na(mtcars))

hadley commented 5 years ago

Also worth thinking about this case:

x <- package_version("1.1.1")
identical(x, x[[1]])
#> [1] TRUE

Created on 2019-01-28 by the reprex package (v0.2.1.9000)

krlmlr commented 5 years ago

The inverse operation seems interesting too. Would it be too bad if vec_c() wrapped all non-vector objects in a list? The current behavior seems suboptimal either way:

qr <- qr(lm(y ~ x, data.frame(x = 1, y = 1)))
class(qr)
#> [1] "qr"
typeof(qr)
#> [1] "list"
vctrs::vec_c(qr)
#> $qr
#>   (Intercept) x
#> 1           1 1
#> attr(,"assign")
#> [1] 0 1
#> 
#> $qraux
#> [1] 1 1
#> 
#> $pivot
#> [1] 1 2
#> 
#> $tol
#> [1] 1e-07
#> 
#> $rank
#> [1] 1

Created on 2019-01-28 by the reprex package (v0.2.1.9000)

hadley commented 5 years ago

Another thought: maybe we need to more explicitly talk about "container" types, and then vec_rip() would only apply to a container.

hadley commented 5 years ago

I think this operation should error if you're attempting to index into a vector that is not a recursive, e.g.:

vec_recursive <- function(x) UseMethod("vec_recursive")
vec_recursive.list_of <- function(x) TRUE
vec_recursive.data.frame <- function(x) TRUE
vec_recursive.default <- function(x) {
  if (is_bare_list(x)) {
    TRUE
  } else if (is_vector(x)) {
    FALSE
  } else {
    stop("Non-vector")
  }
}

vec_extract <- function(x, i) {
  stopifnot(length(i) == 1L)
  stopifnot(vec_recursive(x))

  vec_slice(x, i)[[1]]
}

This clearly distinguishes it from vec_slice() and I don't think the using [[ to strip attributes is important (or commonly used) enough to be worth mimicking.

Note that vec_extract() would always preserve the invariants of the container class, i.e. x %>% vec_extract(i) %>% vec_size() equals vec_size(x) if x is a data frame, and x %>% vec_extract(i) %>% vec_type() equals x %>% attr("ptype") is x is a list_of.

lionel- commented 5 years ago

I think vec_recursive() is not sufficient to determine atomicity. An S3 list can be:

If vec_recursive() returns TRUE, we know we have a vector, but if it returns FALSE we don't know for sure we have an atom.

We might need a more general vec_kind() generic that would return one of:

hadley commented 5 years ago

Why does that distinction matter for indexing?

lionel- commented 5 years ago

I don't know if it matters for extraction (though vec_is_recursive() shouldn't throw an error with non-vectors to make recursive extraction possible). But that distinction is necessary for vec_is() / vec_assert() without arguments. Otherwise, vec_assert(model_fit) won't throw.

DavisVaughan commented 5 years ago

I'm running into the case where it would be nice to have the assignment form of this, [[<-. I do think whatever vctrs solution we come up with for that should work with recursive and non-recursive types. For instance, I might be filling list elements, or I might be filling double elements.

vec_assign() is obviously not what I need to be using here. I want some generic function that can [[<- a single element in both lists and double vectors.

library(vctrs)

out <- vec_init(list(), 3)
elt <- 2:3

# cant use vec_assign, of course
vec_assign(out, 1, elt)
#> No common type for `value` <integer> and `x` <list>.

# this is what i want
out[[1]] <- elt
out
#> [[1]]
#> [1] 2 3
#> 
#> [[2]]
#> NULL
#> 
#> [[3]]
#> NULL

# but i also might "init" a dbl
out <- vec_init(double(), 3)
elt <- 2

# can use vec_assign
vec_assign(out, 1, 2)
#> [1]  2 NA NA

# and i can use this
out[[1]] <- 2
out
#> [1]  2 NA NA

Update) For data frames I think this needs to be able to replace a column, not a row.

romainfrancois commented 5 years ago

Looking into dplyr::summarise() for rowwise data frames, I'm going to need something like vec_rip()

lionel- commented 5 years ago

It'll probably be called vec_get(), but it's still not clear what the semantics should be.

@DavisVaughan suggested it would return rows as lists in case of a data frame, or as a vector of dimensionality n - 1 in case of an array (for instance a dimless vector if a matrix).

In that case, rowwise extraction in data frames becomes a recursive operation in the sense that you can dig deep into them by calling recursively [[ / vec_get(), in the same way you can dig deep into lists. Arrays become kinda recursive as well, though they can't return a different type.

Despite being recursive, this operation doesn't really connect to algorithms ordinarily used with recursive data structures such as flatten(). However, as Davis noted, vec_get(vec_get(x, i), j) on a 2d structure consistently retrieves the i, j element, which is interesting.

DavisVaughan commented 5 years ago

@romainfrancois, so vec_slice() wouldn't be enough for rowwise data frames? Could you share why not? Would returning the first row as a list like this be helpful? A use case might guide the discussion a bit

romainfrancois commented 5 years ago

Probably it's the whole rowwise() approach that makes it a problem. When we have a list column x, we want x to stand for an element of the list, not a length 1 list with the element

jennybc commented 5 years ago

When we have a list column x, we want x to stand for an element of the list, not a length 1 list with the element

This is what I was getting at way, way above:

A word I would like to insert into this discussion is "atom". As in, atomic vectors have atoms that are of the same type. I think maybe vec_rip() gets one atom? I don't have great words around it, but part of what makes list-columns hard for people is that it often feels like you're dealing with one extra layer of indexing.

DavisVaughan commented 5 years ago

What @romainfrancois does here has essentially the same implementation of what I used when attempting to reimplement flatten() with vctrs semantics. https://github.com/tidyverse/dplyr/commit/ebbd15cc877ab45505bae152423338eb19cccdc3#diff-ccca386aac53cf0029fb15ebff8901d5R206

The vec_get() I used: https://github.com/DavisVaughan/vctrs/blob/84e0800269805a89ee26c995d1d63ab3bcff2e8c/R/get.R#L5

Essentially you vec_slice() unless something is a list (but not a data frame), in which case you [[i]].

(This is different than what Lionel mentioned above about the other semantics I proposed)

DavisVaughan commented 4 years ago

I think that it would also be nice to have the equivalent of vec_chop() for whatever vec_get() ends up being.

So:

vec_slice(x, i) -> vec_chop(x, indices = NULL)

vec_get(x, pos) -> vec_*(x, positions = NULL)

positions would be NULL or a list of single positions like list(1, "x", 3)

It would be useful for vec_cast.list.data.frame() and vec_cast.list.default()