Closed DavisVaughan closed 5 years ago
Or maybe allow MARGIN
for backwards compat and translate it to axes
And need a unique()
method
Somewhat harder than I thought because we only allow axis
in the rray_unique()
functions but unique()
allows MARGIN
to have multiple values.
I'm not sure that rray_duplicate_*()
and rray_unique_*()
functions are right. They should probably work more like reducers. Allowing axes
and not dropping dimensions. I also don't think the rotation idea is right.
The distinction between MARGIN and its complementary (axes) is confusing to me. In base, MARGIN is oposite in unique() and duplicated() in my opinion. In rray (rray_sum for instance) the axes use to be the processed dimensions ( while the complementary dimensions stay unafected). But the "axis" in rrayduplicate*() seems more like a (length one) MARGIN to me. I might be missing some intuition on this issue. If you have a criteria to use in these distinctions I'd be glad to to hear it.
I think you're right. In rray_sum() it's definitely working as intended. I think it's wrong in rrayduplicate*() and the unique functions.
Generally, if you compare base apply()'s MARGIN and rray's axes parameters, I think you have already seen that they are complements. Eventually I'll document that explicitly in a vignette
Maybe base R's inconsistency in unique() and duplicated() is confusing me too. I'll take a closer look to see if they are actually inconsistent
Regarding rotation, I find it dificult to do things with rray_rotate(), and would always choose rray_transpose() ,which seems more simple to me, instead. Leaving appart its geometric interest, to me, rray_rotate(x,from,to) is just a transpose fliping the "to" axis, i.e., it is equivalent to: rray_transpose(rray_flip(x,to),SWP(1:vec_dims(x),from,to))) [ where by SWP() I mean swapping 2 elements in the vector, i.e.: SWP=function(x,a,b){res=x;res[[a]]<-x[[b]];res[[b]]<-x[[a]];res} ]
And this flip worries me a bit as a possible source of errors
I'm fairly certain this is how the duplicate functions should work. I had to make a small tweak to rray_split()
for it to work.
library(rray)
library(purrr)
library(vctrs)
x <- array(
c(
c(1.2, 1.3, 1.2, 1.3),
c(1.4, 1.4, 1.2, 1.6)
),
c(2, 2, 2)
)
x
#> , , 1
#>
#> [,1] [,2]
#> [1,] 1.2 1.2
#> [2,] 1.3 1.3
#>
#> , , 2
#>
#> [,1] [,2]
#> [1,] 1.4 1.2
#> [2,] 1.4 1.6
duplicate_splitter <- function(x, axes) {
if (is.null(axes)) {
split_axes <- integer()
}
else {
split_axes <- rev(seq_len(vec_dims(x))[-axes])
}
x_split <- rray_split(x, split_axes)
x_split_flat <- map(x_split, as.vector)
x_split_flat
}
rray_duplicate_any2 <- function(x, axes = NULL) {
x_split_flat <- duplicate_splitter(x, axes)
flat_res <- map_lgl(x_split_flat, vec_duplicate_any)
res <- rray:::keep_dims(flat_res, x, axes)
res
}
rray_duplicate_detect2 <- function(x, axes = NULL) {
x_split_flat <- duplicate_splitter(x, axes)
flat_res <- unlist(map(x_split_flat, vec_duplicate_detect), recursive = F, use.names = F)
res <- rray_reshape(flat_res, vec_dim(x))
res
}
rray_duplicate_id2 <- function(x, axes = NULL) {
x_split_flat <- duplicate_splitter(x, axes)
flat_res <- unlist(map(x_split_flat, vec_duplicate_id), recursive = F, use.names = F)
res <- rray_reshape(flat_res, vec_dim(x))
res
}
rray_duplicate_any2(x)
#> , , 1
#>
#> [,1]
#> [1,] TRUE
rray_duplicate_any2(x, 1L)
#> , , 1
#>
#> [,1] [,2]
#> [1,] FALSE FALSE
#>
#> , , 2
#>
#> [,1] [,2]
#> [1,] TRUE FALSE
rray_duplicate_any2(x, c(1, 2))
#> , , 1
#>
#> [,1]
#> [1,] TRUE
#>
#> , , 2
#>
#> [,1]
#> [1,] TRUE
rray_duplicate_any2(x, 2)
#> , , 1
#>
#> [,1]
#> [1,] TRUE
#> [2,] TRUE
#>
#> , , 2
#>
#> [,1]
#> [1,] FALSE
#> [2,] FALSE
rray_duplicate_detect2(x)
#> , , 1
#>
#> [,1] [,2]
#> [1,] TRUE TRUE
#> [2,] TRUE TRUE
#>
#> , , 2
#>
#> [,1] [,2]
#> [1,] TRUE TRUE
#> [2,] TRUE FALSE
rray_duplicate_detect2(x, 1L)
#> , , 1
#>
#> [,1] [,2]
#> [1,] FALSE FALSE
#> [2,] FALSE FALSE
#>
#> , , 2
#>
#> [,1] [,2]
#> [1,] TRUE FALSE
#> [2,] TRUE FALSE
rray_duplicate_detect2(x, c(1, 2))
#> , , 1
#>
#> [,1] [,2]
#> [1,] TRUE TRUE
#> [2,] TRUE TRUE
#>
#> , , 2
#>
#> [,1] [,2]
#> [1,] TRUE FALSE
#> [2,] TRUE FALSE
rray_duplicate_detect2(x, 2)
#> , , 1
#>
#> [,1] [,2]
#> [1,] TRUE TRUE
#> [2,] TRUE TRUE
#>
#> , , 2
#>
#> [,1] [,2]
#> [1,] FALSE FALSE
#> [2,] FALSE FALSE
rray_duplicate_id2(x)
#> , , 1
#>
#> [,1] [,2]
#> [1,] 1 1
#> [2,] 2 2
#>
#> , , 2
#>
#> [,1] [,2]
#> [1,] 5 1
#> [2,] 5 8
rray_duplicate_id2(x, 1L)
#> , , 1
#>
#> [,1] [,2]
#> [1,] 1 1
#> [2,] 2 2
#>
#> , , 2
#>
#> [,1] [,2]
#> [1,] 1 1
#> [2,] 1 2
rray_duplicate_id2(x, c(1, 2))
#> , , 1
#>
#> [,1] [,2]
#> [1,] 1 1
#> [2,] 2 2
#>
#> , , 2
#>
#> [,1] [,2]
#> [1,] 1 3
#> [2,] 1 4
rray_duplicate_id2(x, 2)
#> , , 1
#>
#> [,1] [,2]
#> [1,] 1 1
#> [2,] 1 1
#>
#> , , 2
#>
#> [,1] [,2]
#> [1,] 1 1
#> [2,] 2 2
Created on 2019-05-09 by the reprex package (v0.2.1.9000)
And this is how rray_unique()
and friends should work. It's a bit different because it can only take 1 axis, otherwise you can end up with ragged arrays. It essentially subsets along that axis, flattens all the results, and compares those. This is what numpy does too. I think it's the only sensible behavior. I think its actually what unique()
does too.
library(rray)
library(purrr)
library(vctrs)
y <- array(
c(
c(1.2, 1.3, 1.2, 1.6),
c(1.2, 1.4, 1.2, 1.6)
),
c(4, 1, 4)
)
y
#> , , 1
#>
#> [,1]
#> [1,] 1.2
#> [2,] 1.3
#> [3,] 1.2
#> [4,] 1.6
#>
#> , , 2
#>
#> [,1]
#> [1,] 1.2
#> [2,] 1.4
#> [3,] 1.2
#> [4,] 1.6
#>
#> , , 3
#>
#> [,1]
#> [1,] 1.2
#> [2,] 1.3
#> [3,] 1.2
#> [4,] 1.6
#>
#> , , 4
#>
#> [,1]
#> [1,] 1.2
#> [2,] 1.4
#> [3,] 1.2
#> [4,] 1.6
# `axis` here has to work a bit differently than it normally does
# you can think about it as slicing each individual axis, and then
# flattening the results, and comparing those. otherwise you can get
# ragged arrays
rray_unique_loc2 <- function(x, axis) {
x_split <- rray_split(x, axis)
# due to r-lib/vctrs#327
x_split_flat <- map(x_split, as.vector)
vec_unique_loc(x_split_flat)
}
rray_unique2 <- function(x, axis) {
loc <- rray_unique_loc2(x, axis)
rray_slice(x, loc, axis)
}
rray_unique_count <- function(x, axis) {
vec_size(rray_unique_loc2(x, axis))
}
rray_unique_loc2(y, 1)
#> [1] 1 2 4
rray_unique_loc2(y, 2)
#> [1] 1
rray_unique_loc2(y, 3)
#> [1] 1 2
rray_unique2(y, 1)
#> , , 1
#>
#> [,1]
#> [1,] 1.2
#> [2,] 1.3
#> [3,] 1.6
#>
#> , , 2
#>
#> [,1]
#> [1,] 1.2
#> [2,] 1.4
#> [3,] 1.6
#>
#> , , 3
#>
#> [,1]
#> [1,] 1.2
#> [2,] 1.3
#> [3,] 1.6
#>
#> , , 4
#>
#> [,1]
#> [1,] 1.2
#> [2,] 1.4
#> [3,] 1.6
rray_unique2(y, 2)
#> , , 1
#>
#> [,1]
#> [1,] 1.2
#> [2,] 1.3
#> [3,] 1.2
#> [4,] 1.6
#>
#> , , 2
#>
#> [,1]
#> [1,] 1.2
#> [2,] 1.4
#> [3,] 1.2
#> [4,] 1.6
#>
#> , , 3
#>
#> [,1]
#> [1,] 1.2
#> [2,] 1.3
#> [3,] 1.2
#> [4,] 1.6
#>
#> , , 4
#>
#> [,1]
#> [1,] 1.2
#> [2,] 1.4
#> [3,] 1.2
#> [4,] 1.6
rray_unique2(y, 3)
#> , , 1
#>
#> [,1]
#> [1,] 1.2
#> [2,] 1.3
#> [3,] 1.2
#> [4,] 1.6
#>
#> , , 2
#>
#> [,1]
#> [1,] 1.2
#> [2,] 1.4
#> [3,] 1.2
#> [4,] 1.6
rray_unique_count(y, 1)
#> [1] 3
rray_unique_count(y, 2)
#> [1] 1
rray_unique_count(y, 3)
#> [1] 2
Created on 2019-05-09 by the reprex package (v0.2.1.9000)
humm I think these (duplicate*) functions lack some transpositions.
But before I can go any further: Is this result correct?:
vec_duplicate_detect(rray(c(1,1,2,3),c(2,2)))
TRUE TRUE
Shouldn't it be F,F because c(1,2) is not duplicated with c(1,3)?
It's not! I actually just reported that over in vctrs, but with a vec_unique()
function. The same engine underlies them both. I currently get around that by flattening the input.
library(vctrs)
library(rray)
x <- rray(c(1,2,2,2), c(2, 2))
vec_duplicate_detect(x)
#> [1] FALSE FALSE
# vctrs only looks at the first element in the row
# to make the comparison, so this looks like a duplicate
x <- rray(c(1,1,2,3), c(2, 2))
vec_duplicate_detect(x)
#> [1] TRUE TRUE
# if you flatten them, it can tell the difference
vec_duplicate_detect(
list(
as.vector(x[1,]),
as.vector(x[2,])
)
)
#> [1] FALSE FALSE
Created on 2019-05-09 by the reprex package (v0.2.1.9000)
All right. Then, when it works as expected, I think these functions could also be done (I dont know if more or less eficiently than with purrr) with the following: a transpose to take the MARGINS to the head (keeping their order), a reshape to melt them in first dimension, the application of the vctrs function and then set the dim of the result as the original dim but with "axes" degenerated to 1, i.e.: {d<-dim(x); d[axes] <- 1; d}
Things to note about the duplicate functions:
rray_duplicate_any()
acts like a reducer (as rray_any()
does) so if you use it on a (2, 2, 2)
and set axes = 1
then the result should be (1, 2, 2)
.
The other two should maintain the same shape as the input (this is consistent with maintaining the same size, which is done in vctrs with vec_duplicate_id()
)
Yeah. I agree with the design , wich seems sound and intuitive. But I think you must transpose. Do you agree with the result of your last example?: rray_duplicate_id2(x, 2)
Shouldn't it be?: rray_duplicate_id2(x, 2)
If you prefer to stick with the purr implementation I think it would be something like this: rray_duplicate_id3 <- function(x, axes = NULL) { x_split_flat <- duplicate_splitter(x, axes) flat_res <- unlist(map(x_split_flat, vec_duplicate_id), recursive = F, use.names = F) res <- rray_reshape(flat_res, c(vec_dim(x)[axes],vec_dim(x)[-axes]) ) res <- rray_transpose(res, order(c(axes,seq_len( vec_dims(x))[-axes]))) res } (Well, I hope you do not take my word , this is tricky)
That's essentially where I ended up too. Why order()
?
Also, should it require that axes
are provided in a sorted increasing order? Otherwise the answers seem strange
library(rray)
library(purrr)
library(vctrs)
x <- array(
c(
c(1.2, 1.3, 1.2, 1.3),
c(1.4, 1.4, 1.2, 1.6)
),
c(2, 2, 2)
)
x
#> , , 1
#>
#> [,1] [,2]
#> [1,] 1.2 1.2
#> [2,] 1.3 1.3
#>
#> , , 2
#>
#> [,1] [,2]
#> [1,] 1.4 1.2
#> [2,] 1.4 1.6
duplicate_splitter <- function(x, axes) {
if (is.null(axes)) {
axes <- integer()
}
split_axes <- rev(seq_len(vec_dims(x))[-axes])
x_split <- rray_split(x, split_axes)
x_split_flat <- map(x_split, as.vector)
x_split_flat
}
duplicate_transpose <- function(res, x, axes) {
if (is.null(axes)) {
axes <- seq_len(vec_dims(x))
}
x_axes_seq <- seq_len(vec_dims(x))
axes_complement <- x_axes_seq[-axes]
rray_transpose(res, c(axes, axes_complement))
}
rray_duplicate_id2 <- function(x, axes = NULL) {
x_split_flat <- duplicate_splitter(x, axes)
flat_res <- unlist(map(x_split_flat, vec_duplicate_id), recursive = F, use.names = F)
res <- rray_reshape(flat_res, vec_dim(x))
res
}
rray_duplicate_id3 <- function(x, axes = NULL) {
x_split_flat <- duplicate_splitter(x, axes)
flat_res <- unlist(map(x_split_flat, vec_duplicate_id), recursive = F, use.names = F)
res <- rray_reshape(flat_res, vec_dim(x))
res <- duplicate_transpose(res, x, axes)
res
}
rray_duplicate_id2(x, 2)
#> , , 1
#>
#> [,1] [,2]
#> [1,] 1 1
#> [2,] 1 1
#>
#> , , 2
#>
#> [,1] [,2]
#> [1,] 1 1
#> [2,] 2 2
rray_duplicate_id3(x, 2)
#> , , 1
#>
#> [,1] [,2]
#> [1,] 1 1
#> [2,] 1 1
#>
#> , , 2
#>
#> [,1] [,2]
#> [1,] 1 2
#> [2,] 1 2
rray_duplicate_id2(x, c(1, 2))
#> , , 1
#>
#> [,1] [,2]
#> [1,] 1 1
#> [2,] 2 2
#>
#> , , 2
#>
#> [,1] [,2]
#> [1,] 1 3
#> [2,] 1 4
rray_duplicate_id2(x, c(2, 1))
#> , , 1
#>
#> [,1] [,2]
#> [1,] 1 1
#> [2,] 2 2
#>
#> , , 2
#>
#> [,1] [,2]
#> [1,] 1 3
#> [2,] 1 4
rray_duplicate_id3(x, c(1, 2))
#> , , 1
#>
#> [,1] [,2]
#> [1,] 1 1
#> [2,] 2 2
#>
#> , , 2
#>
#> [,1] [,2]
#> [1,] 1 3
#> [2,] 1 4
rray_duplicate_id3(x, c(2, 1))
#> , , 1
#>
#> [,1] [,2]
#> [1,] 1 2
#> [2,] 1 2
#>
#> , , 2
#>
#> [,1] [,2]
#> [1,] 1 1
#> [2,] 3 4
Created on 2019-05-09 by the reprex package (v0.2.1.9000)
order() gives the inverse of the permutation (that we have to undo). And yes, there is a missmatch between what I do and the splitter. I was wise when I put my disclaimer !
rray_duplicate_id4 <- function(x, axes = NULL) { x_split_flat <- duplicate_splitter(x, axes) flat_res <- unlist(map(x_split_flat, vec_duplicate_id), recursive = F, use.names = F)
res <- rray_reshape(flat_res, c(vec_dim(x)[sort(axes)],vec_dim(x)[-axes]) ) res <- rray_transpose(res, order(c(sort(axes),seq_len( vec_dims(x))[-axes]))) res } (I still dont bet my soul on this one)
Alright I think we've got a working version over in #157. It's going to be slow due mainly to rray_split()
, but hopefully I can speed all of that up later. For now, I'm just happy to have a working version, it was tricky to get right. Thanks for the help!
If you want it as reference for speed comparisons, I made this alternative to the most complex one: rray_duplicate_id() :
fast_id<-function(x,axes){
arr<-as.array(x)
naxes=setdiff(seq_len(vec_dims(x)),axes)
melt <- rray_reshape(rray_transpose(x,c(axes,naxes)) ,
c(prod(dim(x)[axes]),dim(x)[naxes]) )
if(vec_dims(melt) == 1 ) res <- match(melt, melt) else
res <- apply(melt,2:vec_dims(melt),function(i)match(i,i))
res <- rray_reshape(res,dim(x)[c(axes,naxes)])
rray_transpose(res,order(c(axes,naxes)))
}
Only arrays, no checking and a pretentious name, but the idea is to show what can can be achieved with base apply.
Playing with these two functions I stumbled with what is a surprising result for me.. It seems to be a degradation of the rray function as we increase the splitting part. I got similar figures several times, including the specially slow last meassure of rray_duplicate_id. The matching of the melted vector onto itself part is faster in rray as one would expect. I have almost no knowledge about performance measure, so I just post it here in case it seems interesting to you
library(rray)
library(vctrs)
fast_id<-function(x,axes){
arr<-as.array(x)
naxes=setdiff(seq_len(vec_dims(x)),axes)
melt <- rray_reshape(rray_transpose(x,c(axes,naxes)) ,
c(prod(dim(x)[axes]),dim(x)[naxes]) )
if(vec_dims(melt) == 1 ) res <- match(melt, melt) else
res <- apply(melt,2:vec_dims(melt),function(i)match(i,i))
res <- rray_reshape(res,dim(x)[c(axes,naxes)])
rray_transpose(res,order(c(axes,naxes)))
}
perf_comp <-function(){
bigarr <- rray(round(100*rnorm(1e7)),c(5,20,10,100,100))
print("using: rray(round(100*rnorm(1e7)),c(5,20,10,100,100))")
for(i in 5:1) {
print(paste("rray_duplicate_id vs fast_id melting first ", i, " dimensions"))
print(unname(as.vector(system.time(rray_duplicate_id(bigarr,1:i),TRUE))[1:3]))
print(unname(as.vector(system.time(fast_id(bigarr,1:i),TRUE))[1:3]))
}
}
perf_comp()
#> [1] "using: rray(round(100*rnorm(1e7)),c(5,20,10,100,100))"
#> [1] "rray_duplicate_id vs fast_id melting first 5 dimensions"
#> [1] 0.89 0.72 1.61
#> [1] 1.12 1.06 2.19
#> [1] "rray_duplicate_id vs fast_id melting first 4 dimensions"
#> [1] 0.83 0.75 1.57
#> [1] 1.18 0.94 2.13
#> [1] "rray_duplicate_id vs fast_id melting first 3 dimensions"
#> [1] 2.19 0.50 2.69
#> [1] 1.15 1.01 2.17
#> [1] "rray_duplicate_id vs fast_id melting first 2 dimensions"
#> [1] 14.80 0.39 15.22
#> [1] 1.76 0.51 2.28
#> [1] "rray_duplicate_id vs fast_id melting first 1 dimensions"
#> [1] 272.68 6.23 280.53
#> [1] 8.38 0.75 9.61
Created on 2019-05-12 by the reprex package (v0.2.1)
Much of the time seems to be spent in rray_split()
performing the map()
of the vec_restore()
. The actual splitting is pretty fast, all things considered:
> system.time(rray__split(bigarr, as_cpp_idx(2:5)), gcFirst = T)
user system elapsed
4.776 0.074 4.850
When vctrs exposes vec_restore()
for me to use at the C++ level, I expect this to improve because we won't have to make the extra copies that are currently happening. It's good to point it out though, so thanks!
I've decided to just wrap duplicated()
and anyDuplicated()
directly for the rray methods. This should match user's expectations. If they want "better" behavior they can use the rray duplicate functions directly.
, so thanks!
You are welcome. I am happy if I can be of any help.
With
axes
rather thanMARGIN