tidyverse / purrr

A functional programming toolkit for R
https://purrr.tidyverse.org/
Other
1.27k stars 271 forks source link

Feature request: function for splitting / unflattening lists #1127

Open prototaxites opened 4 months ago

prototaxites commented 4 months ago

{purrr} provides list_flatten(), which takes a list and removes a single layer of hierarchy. However, it would be quite useful to be able to do the reverse, and "unflatten" a list or vector into a list of lists. While the exact inverse of a flatten operation is likely to be impossible, this could be usefully implemented by allowing the user to specify either some kind of grouping vector (a character string or numeric vector the same length as the list to "unflatten"), or a chunk size at which to aggregate.

This would be useful in cases when a user has a list or vector, and a function that is able to operate on subsets of that vector, rather than solely individual elements, and especially when there might be a useful speed gain operating over chunks but not over the whole vector at once.

For example, in my case: I have some vector x of column indices of a matrix, and want to do some matrix multiplication. I can do this with a single operation without using purrr, but for a large matrix this is also likely to be slow. I can also do this column-wise by mapping over x, but this can be slow depending on the number of columns. It would be useful to be able to split x into a list of equal-sized chunks to find an optimum chunk size for computation, before combining the final output with reduce. (note that in my case, the full computation is very slow as I am using rvar types from the {posterior} package rather than scalars)

library(purrr)

beta <- matrix(rnorm(100000, 0, 1), ncol = 10000)
mat <- matrix(runif(10000, 0, 1), ncol = 10)
x <- 1:10000

## single computation
dim(mat %*% beta)
# [1] 1000 10000

## completely split computation
dim(
  map(x, \(y) mat %*% beta[,y]) |> 
  reduce(cbind)
  )
# [1] 1000 10000

## chunked computation - chunks of size 20
chunk_size <- 20
z <- seq_along(x)
chunks <- split(x, ceiling(z/chunk_size))

dim(
  map(chunks, \(y) mat %*% beta[,y]) |> 
  reduce(cbind)
)
# [1] 1000 10000

The proposal would add something like the following:

## split into even-sized chunks
chunks <- list_split(x, n = 20)
# [[1]]
# [1]  1  2  3  4  5  6  7  8  9 10 11 12 13 14 15 16 17 18 19 20
# 
# [[2]]
# [1] 21 22 23 24 25 26 27 28 29 30 31 32 33 34 35 36 37 38 39 40
# 
# [[3]]
# [1] 41 42 43 44 45 46 47 48 49 50 51 52 53 54 55 56 57 58 59 60

# some random grouping vector - this is similar to how base split() works now
group_vec <- sample(letters[1:3], 100, replace = TRUE)
chunks <- list_split(x, groups = group_vec)
# $a
# [1]  10  12  16  28  34  41  51  52  60  65  68  70  71  72  73  78  83  90...
# 
# $b
# [1]  1  2  3  4  5  8 11 15 17 19 23 29 33 35 38 40 42 43 44 47 49 54 55...
# 
# $c
# [1]  6  7  9 13 14 18 20 21 22 24 25 26 27 30 31 32 36 37 39 45 46 48...

map(chunks, some_function)

I see a similar function was proposed and closed here: https://github.com/tidyverse/purrr/issues/274, but I think this proposal differs in that it specifically is about splitting/unflattening lists rather than dataframe rows.

hadley commented 1 month ago

I think that the implementation of this would be relatively straightforward since you could use vec_chop() — you'd just have to figure out how to generate the right vector of index. And it'll require some thinking about the interface, since you might want to provide the number of groups, the group size, or an actual vector of group ids.