r-lib / rray

Simple Arrays
https://rray.r-lib.org
GNU General Public License v3.0
129 stars 12 forks source link

Add `anyDuplicated()` and `duplicated()` methods #149

Closed DavisVaughan closed 5 years ago

DavisVaughan commented 5 years ago

With axes rather than MARGIN

DavisVaughan commented 5 years ago

Or maybe allow MARGIN for backwards compat and translate it to axes

DavisVaughan commented 5 years ago

And need a unique() method

DavisVaughan commented 5 years ago

Somewhat harder than I thought because we only allow axis in the rray_unique() functions but unique() allows MARGIN to have multiple values.

DavisVaughan commented 5 years ago

I'm not sure that rray_duplicate_*() and rray_unique_*() functions are right. They should probably work more like reducers. Allowing axes and not dropping dimensions. I also don't think the rotation idea is right.

juangomezduaso commented 5 years ago

The distinction between MARGIN and its complementary (axes) is confusing to me. In base, MARGIN is oposite in unique() and duplicated() in my opinion. In rray (rray_sum for instance) the axes use to be the processed dimensions ( while the complementary dimensions stay unafected). But the "axis" in rrayduplicate*() seems more like a (length one) MARGIN to me. I might be missing some intuition on this issue. If you have a criteria to use in these distinctions I'd be glad to to hear it.

DavisVaughan commented 5 years ago

I think you're right. In rray_sum() it's definitely working as intended. I think it's wrong in rrayduplicate*() and the unique functions.

Generally, if you compare base apply()'s MARGIN and rray's axes parameters, I think you have already seen that they are complements. Eventually I'll document that explicitly in a vignette

DavisVaughan commented 5 years ago

Maybe base R's inconsistency in unique() and duplicated() is confusing me too. I'll take a closer look to see if they are actually inconsistent

juangomezduaso commented 5 years ago

Regarding rotation, I find it dificult to do things with rray_rotate(), and would always choose rray_transpose() ,which seems more simple to me, instead. Leaving appart its geometric interest, to me, rray_rotate(x,from,to) is just a transpose fliping the "to" axis, i.e., it is equivalent to: rray_transpose(rray_flip(x,to),SWP(1:vec_dims(x),from,to))) [ where by SWP() I mean swapping 2 elements in the vector, i.e.: SWP=function(x,a,b){res=x;res[[a]]<-x[[b]];res[[b]]<-x[[a]];res} ]

juangomezduaso commented 5 years ago

And this flip worries me a bit as a possible source of errors

DavisVaughan commented 5 years ago

I'm fairly certain this is how the duplicate functions should work. I had to make a small tweak to rray_split() for it to work.

library(rray)
library(purrr)
library(vctrs)

x <- array(
  c(
    c(1.2, 1.3, 1.2, 1.3),
    c(1.4, 1.4, 1.2, 1.6)
  ),
  c(2, 2, 2)
)

x
#> , , 1
#> 
#>      [,1] [,2]
#> [1,]  1.2  1.2
#> [2,]  1.3  1.3
#> 
#> , , 2
#> 
#>      [,1] [,2]
#> [1,]  1.4  1.2
#> [2,]  1.4  1.6

duplicate_splitter <- function(x, axes) {

  if (is.null(axes)) {
    split_axes <- integer()
  }
  else {
    split_axes <- rev(seq_len(vec_dims(x))[-axes])
  }

  x_split <- rray_split(x, split_axes)
  x_split_flat <- map(x_split, as.vector)
  x_split_flat
}

rray_duplicate_any2 <- function(x, axes = NULL) {
  x_split_flat <- duplicate_splitter(x, axes)
  flat_res <- map_lgl(x_split_flat, vec_duplicate_any)
  res <- rray:::keep_dims(flat_res, x, axes)
  res
}

rray_duplicate_detect2 <- function(x, axes = NULL) {
  x_split_flat <- duplicate_splitter(x, axes)
  flat_res <- unlist(map(x_split_flat, vec_duplicate_detect), recursive = F, use.names = F)
  res <- rray_reshape(flat_res, vec_dim(x))
  res
}

rray_duplicate_id2 <- function(x, axes = NULL) {
  x_split_flat <- duplicate_splitter(x, axes)
  flat_res <- unlist(map(x_split_flat, vec_duplicate_id), recursive = F, use.names = F)
  res <- rray_reshape(flat_res, vec_dim(x))
  res
}

rray_duplicate_any2(x)
#> , , 1
#> 
#>      [,1]
#> [1,] TRUE
rray_duplicate_any2(x, 1L)
#> , , 1
#> 
#>       [,1]  [,2]
#> [1,] FALSE FALSE
#> 
#> , , 2
#> 
#>      [,1]  [,2]
#> [1,] TRUE FALSE
rray_duplicate_any2(x, c(1, 2))
#> , , 1
#> 
#>      [,1]
#> [1,] TRUE
#> 
#> , , 2
#> 
#>      [,1]
#> [1,] TRUE
rray_duplicate_any2(x, 2)
#> , , 1
#> 
#>      [,1]
#> [1,] TRUE
#> [2,] TRUE
#> 
#> , , 2
#> 
#>       [,1]
#> [1,] FALSE
#> [2,] FALSE

rray_duplicate_detect2(x)
#> , , 1
#> 
#>      [,1] [,2]
#> [1,] TRUE TRUE
#> [2,] TRUE TRUE
#> 
#> , , 2
#> 
#>      [,1]  [,2]
#> [1,] TRUE  TRUE
#> [2,] TRUE FALSE
rray_duplicate_detect2(x, 1L)
#> , , 1
#> 
#>       [,1]  [,2]
#> [1,] FALSE FALSE
#> [2,] FALSE FALSE
#> 
#> , , 2
#> 
#>      [,1]  [,2]
#> [1,] TRUE FALSE
#> [2,] TRUE FALSE
rray_duplicate_detect2(x, c(1, 2))
#> , , 1
#> 
#>      [,1] [,2]
#> [1,] TRUE TRUE
#> [2,] TRUE TRUE
#> 
#> , , 2
#> 
#>      [,1]  [,2]
#> [1,] TRUE FALSE
#> [2,] TRUE FALSE
rray_duplicate_detect2(x, 2)
#> , , 1
#> 
#>      [,1] [,2]
#> [1,] TRUE TRUE
#> [2,] TRUE TRUE
#> 
#> , , 2
#> 
#>       [,1]  [,2]
#> [1,] FALSE FALSE
#> [2,] FALSE FALSE

rray_duplicate_id2(x)
#> , , 1
#> 
#>      [,1] [,2]
#> [1,]    1    1
#> [2,]    2    2
#> 
#> , , 2
#> 
#>      [,1] [,2]
#> [1,]    5    1
#> [2,]    5    8
rray_duplicate_id2(x, 1L)
#> , , 1
#> 
#>      [,1] [,2]
#> [1,]    1    1
#> [2,]    2    2
#> 
#> , , 2
#> 
#>      [,1] [,2]
#> [1,]    1    1
#> [2,]    1    2
rray_duplicate_id2(x, c(1, 2))
#> , , 1
#> 
#>      [,1] [,2]
#> [1,]    1    1
#> [2,]    2    2
#> 
#> , , 2
#> 
#>      [,1] [,2]
#> [1,]    1    3
#> [2,]    1    4
rray_duplicate_id2(x, 2)
#> , , 1
#> 
#>      [,1] [,2]
#> [1,]    1    1
#> [2,]    1    1
#> 
#> , , 2
#> 
#>      [,1] [,2]
#> [1,]    1    1
#> [2,]    2    2

Created on 2019-05-09 by the reprex package (v0.2.1.9000)

DavisVaughan commented 5 years ago

And this is how rray_unique() and friends should work. It's a bit different because it can only take 1 axis, otherwise you can end up with ragged arrays. It essentially subsets along that axis, flattens all the results, and compares those. This is what numpy does too. I think it's the only sensible behavior. I think its actually what unique() does too.

library(rray)
library(purrr)
library(vctrs)

y <- array(
  c(
    c(1.2, 1.3, 1.2, 1.6),
    c(1.2, 1.4, 1.2, 1.6)
  ),
  c(4, 1, 4)
)

y
#> , , 1
#> 
#>      [,1]
#> [1,]  1.2
#> [2,]  1.3
#> [3,]  1.2
#> [4,]  1.6
#> 
#> , , 2
#> 
#>      [,1]
#> [1,]  1.2
#> [2,]  1.4
#> [3,]  1.2
#> [4,]  1.6
#> 
#> , , 3
#> 
#>      [,1]
#> [1,]  1.2
#> [2,]  1.3
#> [3,]  1.2
#> [4,]  1.6
#> 
#> , , 4
#> 
#>      [,1]
#> [1,]  1.2
#> [2,]  1.4
#> [3,]  1.2
#> [4,]  1.6

# `axis` here has to work a bit differently than it normally does
# you can think about it as slicing each individual axis, and then
# flattening the results, and comparing those. otherwise you can get
# ragged arrays

rray_unique_loc2 <- function(x, axis) {
  x_split <- rray_split(x, axis)

  # due to r-lib/vctrs#327
  x_split_flat <- map(x_split, as.vector)

  vec_unique_loc(x_split_flat)
}

rray_unique2 <- function(x, axis) {
  loc <- rray_unique_loc2(x, axis)
  rray_slice(x, loc, axis)
}

rray_unique_count <- function(x, axis) {
  vec_size(rray_unique_loc2(x, axis))
}

rray_unique_loc2(y, 1)
#> [1] 1 2 4
rray_unique_loc2(y, 2)
#> [1] 1
rray_unique_loc2(y, 3)
#> [1] 1 2

rray_unique2(y, 1)
#> , , 1
#> 
#>      [,1]
#> [1,]  1.2
#> [2,]  1.3
#> [3,]  1.6
#> 
#> , , 2
#> 
#>      [,1]
#> [1,]  1.2
#> [2,]  1.4
#> [3,]  1.6
#> 
#> , , 3
#> 
#>      [,1]
#> [1,]  1.2
#> [2,]  1.3
#> [3,]  1.6
#> 
#> , , 4
#> 
#>      [,1]
#> [1,]  1.2
#> [2,]  1.4
#> [3,]  1.6
rray_unique2(y, 2)
#> , , 1
#> 
#>      [,1]
#> [1,]  1.2
#> [2,]  1.3
#> [3,]  1.2
#> [4,]  1.6
#> 
#> , , 2
#> 
#>      [,1]
#> [1,]  1.2
#> [2,]  1.4
#> [3,]  1.2
#> [4,]  1.6
#> 
#> , , 3
#> 
#>      [,1]
#> [1,]  1.2
#> [2,]  1.3
#> [3,]  1.2
#> [4,]  1.6
#> 
#> , , 4
#> 
#>      [,1]
#> [1,]  1.2
#> [2,]  1.4
#> [3,]  1.2
#> [4,]  1.6
rray_unique2(y, 3)
#> , , 1
#> 
#>      [,1]
#> [1,]  1.2
#> [2,]  1.3
#> [3,]  1.2
#> [4,]  1.6
#> 
#> , , 2
#> 
#>      [,1]
#> [1,]  1.2
#> [2,]  1.4
#> [3,]  1.2
#> [4,]  1.6

rray_unique_count(y, 1)
#> [1] 3
rray_unique_count(y, 2)
#> [1] 1
rray_unique_count(y, 3)
#> [1] 2

Created on 2019-05-09 by the reprex package (v0.2.1.9000)

juangomezduaso commented 5 years ago

humm I think these (duplicate*) functions lack some transpositions. But before I can go any further: Is this result correct?: vec_duplicate_detect(rray(c(1,1,2,3),c(2,2)))
TRUE TRUE Shouldn't it be F,F because c(1,2) is not duplicated with c(1,3)?

DavisVaughan commented 5 years ago

It's not! I actually just reported that over in vctrs, but with a vec_unique() function. The same engine underlies them both. I currently get around that by flattening the input.

library(vctrs)
library(rray)

x <- rray(c(1,2,2,2), c(2, 2))

vec_duplicate_detect(x)
#> [1] FALSE FALSE

# vctrs only looks at the first element in the row
# to make the comparison, so this looks like a duplicate
x <- rray(c(1,1,2,3), c(2, 2))

vec_duplicate_detect(x)
#> [1] TRUE TRUE

# if you flatten them, it can tell the difference
vec_duplicate_detect(
  list(
    as.vector(x[1,]),
    as.vector(x[2,])
  )
)
#> [1] FALSE FALSE

Created on 2019-05-09 by the reprex package (v0.2.1.9000)

juangomezduaso commented 5 years ago

All right. Then, when it works as expected, I think these functions could also be done (I dont know if more or less eficiently than with purrr) with the following: a transpose to take the MARGINS to the head (keeping their order), a reshape to melt them in first dimension, the application of the vctrs function and then set the dim of the result as the original dim but with "axes" degenerated to 1, i.e.: {d<-dim(x); d[axes] <- 1; d}

DavisVaughan commented 5 years ago

Things to note about the duplicate functions:

juangomezduaso commented 5 years ago

Yeah. I agree with the design , wich seems sound and intuitive. But I think you must transpose. Do you agree with the result of your last example?: rray_duplicate_id2(x, 2)

> , , 1

>

> [,1] [,2]

> [1,] 1 1

> [2,] 1 1

>

> , , 2

>

> [,1] [,2]

> [1,] 1 1

> [2,] 2 2

Shouldn't it be?: rray_duplicate_id2(x, 2)

> , , 1

>

> [,1] [,2]

> [1,] 1 1

> [2,] 1 1

>

> , , 2

>

> [,1] [,2]

> [1,] 1 2

> [2,] 1 2

juangomezduaso commented 5 years ago

If you prefer to stick with the purr implementation I think it would be something like this: rray_duplicate_id3 <- function(x, axes = NULL) { x_split_flat <- duplicate_splitter(x, axes) flat_res <- unlist(map(x_split_flat, vec_duplicate_id), recursive = F, use.names = F) res <- rray_reshape(flat_res, c(vec_dim(x)[axes],vec_dim(x)[-axes]) ) res <- rray_transpose(res, order(c(axes,seq_len( vec_dims(x))[-axes]))) res } (Well, I hope you do not take my word , this is tricky)

DavisVaughan commented 5 years ago

That's essentially where I ended up too. Why order()?

Also, should it require that axes are provided in a sorted increasing order? Otherwise the answers seem strange

DavisVaughan commented 5 years ago
library(rray)
library(purrr)
library(vctrs)

x <- array(
  c(
    c(1.2, 1.3, 1.2, 1.3),
    c(1.4, 1.4, 1.2, 1.6)
  ),
  c(2, 2, 2)
)

x
#> , , 1
#> 
#>      [,1] [,2]
#> [1,]  1.2  1.2
#> [2,]  1.3  1.3
#> 
#> , , 2
#> 
#>      [,1] [,2]
#> [1,]  1.4  1.2
#> [2,]  1.4  1.6

duplicate_splitter <- function(x, axes) {

  if (is.null(axes)) {
    axes <- integer()
  }

  split_axes <- rev(seq_len(vec_dims(x))[-axes])
  x_split <- rray_split(x, split_axes)
  x_split_flat <- map(x_split, as.vector)
  x_split_flat
}

duplicate_transpose <- function(res, x, axes) {
  if (is.null(axes)) {
    axes <- seq_len(vec_dims(x))
  }

  x_axes_seq <- seq_len(vec_dims(x))
  axes_complement <- x_axes_seq[-axes]
  rray_transpose(res, c(axes, axes_complement))
}

rray_duplicate_id2 <- function(x, axes = NULL) {
  x_split_flat <- duplicate_splitter(x, axes)
  flat_res <- unlist(map(x_split_flat, vec_duplicate_id), recursive = F, use.names = F)
  res <- rray_reshape(flat_res, vec_dim(x))
  res
}

rray_duplicate_id3 <- function(x, axes = NULL) {
  x_split_flat <- duplicate_splitter(x, axes)
  flat_res <- unlist(map(x_split_flat, vec_duplicate_id), recursive = F, use.names = F)
  res <- rray_reshape(flat_res, vec_dim(x))
  res <- duplicate_transpose(res, x, axes)
  res
}

rray_duplicate_id2(x, 2)
#> , , 1
#> 
#>      [,1] [,2]
#> [1,]    1    1
#> [2,]    1    1
#> 
#> , , 2
#> 
#>      [,1] [,2]
#> [1,]    1    1
#> [2,]    2    2

rray_duplicate_id3(x, 2)
#> , , 1
#> 
#>      [,1] [,2]
#> [1,]    1    1
#> [2,]    1    1
#> 
#> , , 2
#> 
#>      [,1] [,2]
#> [1,]    1    2
#> [2,]    1    2

rray_duplicate_id2(x, c(1, 2))
#> , , 1
#> 
#>      [,1] [,2]
#> [1,]    1    1
#> [2,]    2    2
#> 
#> , , 2
#> 
#>      [,1] [,2]
#> [1,]    1    3
#> [2,]    1    4
rray_duplicate_id2(x, c(2, 1))
#> , , 1
#> 
#>      [,1] [,2]
#> [1,]    1    1
#> [2,]    2    2
#> 
#> , , 2
#> 
#>      [,1] [,2]
#> [1,]    1    3
#> [2,]    1    4

rray_duplicate_id3(x, c(1, 2))
#> , , 1
#> 
#>      [,1] [,2]
#> [1,]    1    1
#> [2,]    2    2
#> 
#> , , 2
#> 
#>      [,1] [,2]
#> [1,]    1    3
#> [2,]    1    4
rray_duplicate_id3(x, c(2, 1))
#> , , 1
#> 
#>      [,1] [,2]
#> [1,]    1    2
#> [2,]    1    2
#> 
#> , , 2
#> 
#>      [,1] [,2]
#> [1,]    1    1
#> [2,]    3    4

Created on 2019-05-09 by the reprex package (v0.2.1.9000)

juangomezduaso commented 5 years ago

order() gives the inverse of the permutation (that we have to undo). And yes, there is a missmatch between what I do and the splitter. I was wise when I put my disclaimer !

rray_duplicate_id4 <- function(x, axes = NULL) { x_split_flat <- duplicate_splitter(x, axes) flat_res <- unlist(map(x_split_flat, vec_duplicate_id), recursive = F, use.names = F)

res <- rray_reshape(flat_res, c(vec_dim(x)[sort(axes)],vec_dim(x)[-axes]) ) res <- rray_transpose(res, order(c(sort(axes),seq_len( vec_dims(x))[-axes]))) res } (I still dont bet my soul on this one)

DavisVaughan commented 5 years ago

Alright I think we've got a working version over in #157. It's going to be slow due mainly to rray_split(), but hopefully I can speed all of that up later. For now, I'm just happy to have a working version, it was tricky to get right. Thanks for the help!

juangomezduaso commented 5 years ago

If you want it as reference for speed comparisons, I made this alternative to the most complex one: rray_duplicate_id() :

fast_id<-function(x,axes){
  arr<-as.array(x)
  naxes=setdiff(seq_len(vec_dims(x)),axes)
  melt <- rray_reshape(rray_transpose(x,c(axes,naxes))  ,
     c(prod(dim(x)[axes]),dim(x)[naxes]) )
  if(vec_dims(melt) == 1 ) res <- match(melt, melt) else
  res <- apply(melt,2:vec_dims(melt),function(i)match(i,i))
  res <- rray_reshape(res,dim(x)[c(axes,naxes)])
   rray_transpose(res,order(c(axes,naxes)))
}

Only arrays, no checking and a pretentious name, but the idea is to show what can can be achieved with base apply.

juangomezduaso commented 5 years ago

Playing with these two functions I stumbled with what is a surprising result for me.. It seems to be a degradation of the rray function as we increase the splitting part. I got similar figures several times, including the specially slow last meassure of rray_duplicate_id. The matching of the melted vector onto itself part is faster in rray as one would expect. I have almost no knowledge about performance measure, so I just post it here in case it seems interesting to you


library(rray)
library(vctrs)
fast_id<-function(x,axes){
  arr<-as.array(x)
  naxes=setdiff(seq_len(vec_dims(x)),axes)
  melt <- rray_reshape(rray_transpose(x,c(axes,naxes))  ,
                       c(prod(dim(x)[axes]),dim(x)[naxes]) )
  if(vec_dims(melt) == 1 ) res <- match(melt, melt) else
    res <- apply(melt,2:vec_dims(melt),function(i)match(i,i))
  res <- rray_reshape(res,dim(x)[c(axes,naxes)])
  rray_transpose(res,order(c(axes,naxes)))
}

perf_comp <-function(){
  bigarr <- rray(round(100*rnorm(1e7)),c(5,20,10,100,100))
  print("using: rray(round(100*rnorm(1e7)),c(5,20,10,100,100))")
  for(i in 5:1) {
    print(paste("rray_duplicate_id vs fast_id melting first ", i, " dimensions"))
    print(unname(as.vector(system.time(rray_duplicate_id(bigarr,1:i),TRUE))[1:3]))
    print(unname(as.vector(system.time(fast_id(bigarr,1:i),TRUE))[1:3]))
  }
}
perf_comp()
#> [1] "using: rray(round(100*rnorm(1e7)),c(5,20,10,100,100))"
#> [1] "rray_duplicate_id vs fast_id melting first  5  dimensions"
#> [1] 0.89 0.72 1.61
#> [1] 1.12 1.06 2.19
#> [1] "rray_duplicate_id vs fast_id melting first  4  dimensions"
#> [1] 0.83 0.75 1.57
#> [1] 1.18 0.94 2.13
#> [1] "rray_duplicate_id vs fast_id melting first  3  dimensions"
#> [1] 2.19 0.50 2.69
#> [1] 1.15 1.01 2.17
#> [1] "rray_duplicate_id vs fast_id melting first  2  dimensions"
#> [1] 14.80  0.39 15.22
#> [1] 1.76 0.51 2.28
#> [1] "rray_duplicate_id vs fast_id melting first  1  dimensions"
#> [1] 272.68   6.23 280.53
#> [1] 8.38 0.75 9.61

Created on 2019-05-12 by the reprex package (v0.2.1)

DavisVaughan commented 5 years ago

Much of the time seems to be spent in rray_split() performing the map() of the vec_restore(). The actual splitting is pretty fast, all things considered:

> system.time(rray__split(bigarr, as_cpp_idx(2:5)), gcFirst = T)
   user  system elapsed 
  4.776   0.074   4.850 

When vctrs exposes vec_restore() for me to use at the C++ level, I expect this to improve because we won't have to make the extra copies that are currently happening. It's good to point it out though, so thanks!

DavisVaughan commented 5 years ago

I've decided to just wrap duplicated() and anyDuplicated() directly for the rray methods. This should match user's expectations. If they want "better" behavior they can use the rray duplicate functions directly.

juangomezduaso commented 5 years ago

, so thanks!

You are welcome. I am happy if I can be of any help.