Closed maelle closed 1 day ago
L559 can be deleted, the rest is more difficult.
value is a named list.
the code above is for the case where we don't have as many values as vertices. we create a list as long as the number of vertices. then the indices are used to fill the list elements. the other elements remain NULL.
surely there's an elegant way to do this?!
elegant and readable
first fill with NA, then make NA NULL? :thinking: no, they are NA in the current version.
a good thing is that the lines are covered by tests
first make the loop easier then use purrr::reduce?
tmp <- rep(NA, length(vs))
tmp[index] <- value[[i]]
value[[i]] <- tmp
oooh for value a vector, we get NA as fillers but for value a named list, we get NULL as fillers.
My attempt to isolate the essence of the coding:
vs <- c(1,2,3,4)
index <- c(1,4)
value <- list(list(a= 10, d = 14))
names(value) <- "color"
i <- 1
tmp <- value[[i]] # Copy structure of value.
length(tmp) <- 0 # Set list to empty, while removing names.
length(tmp) <- length(vs) # Extend to all vertices with NULL.
tmp[index] <- value[[i]] # Copy indexed elements,
value[[1]] <- tmp # into value.
names(value[[i]])
The effect of length(tmp) <- 0
is to remove the names initially present in value
.
When leaving out the statement the names in value
in this example are: [1] "a" "d" "" ""
.
In this example using the old logic the expression
>value[[1]]["a"]
will give
$<NA>
NULL
thereby blocking indexing by name. In the new logic the result will be:
$a
[1] 10
Removing the names could be a matter of defensive coding. All the code need is some explanation.
Create an empty named list.
L558, L559 are equivalent to:
tmp <- setNames(list(), character(0)) # Easy to read.
or alternatively
tmp <- list(dummy = 1)[NULL] # Faster.
or
tmp <- value[[i]][NULL] # As current coding.
To replace lines L558, L559, L560 by a single line:
tmp <- 'length<-'(list(dummy = 1)[NULL], length(vs))
thanks! @krlmlr helped me shorten the code, I'll soon finish up the related PR.
My thoughts. The proposed solution is indeed concise. It uses the match() function, which can be costly if the graph has many vertices. A similar solution can be achieved in base R by lapply() instead of map(). A base R solution using indexing by a vector is more verbose, but faster.
# -------------------------------
library(purrr)
library(microbenchmark)
n <- 1E5; m <- 1E2;
vs <- seq(n)
index <- seq(m)
names(index) <- index
value <- list(rev = as.list(rev(index)), nchar = as.list(nchar(index)) )
microbenchmark(
"This" = { value_nw1 <- value
for (i in seq_along(value)) {
tmp <- value[[i]]
length(tmp) <- 0
length(tmp) <- length(vs)
tmp[index] <- value[[i]]
value_nw1[[i]] <- tmp
}
},
"Purrr" = { value_nw2 <- purrr::map(value, ~.x[match(seq_along(vs), index)]) },
"Rmatch" = { value_nw3 <- lapply(value, function(x) { x[match(seq_along(vs), index)]} ) },
"Rbase" = { value_nw4 <-
lapply(
value,
function(x)
{ tmp <- x[NULL]; length(tmp) <- length(vs); tmp[index] <- x; tmp}
)
},
times= 100,
unit="relative"
)
# Unit: relative
# expr min lq mean median uq max neval
# This 9.187689 6.649874 4.097758 6.472774 5.603021 1.0953775 100
# Purrr 4.552082 3.618245 2.618367 3.620971 3.551356 0.9894297 100
# Rmatch 3.655005 3.473635 2.954367 3.458746 3.226840 1.0147469 100
# Rbase 1.000000 1.000000 1.000000 1.000000 1.000000 1.0000000 100
identical(value_nw2, value_nw3)
# [1] TRUE
identical(value_nw1, value_nw4)
# [1] TRUE
note to self: use rlang's "purrr" standalone file
Thanks, good catch. Using match()
indeed causes an overhead. Let's go with lapply()
(or map()
from https://github.com/r-lib/rlang/blob/main/R/standalone-purrr.R) plus the Rbase variant plus comments why we chose this solution: don't want to slow down code that is used for many operations.
For continuous benchmarking, https://github.com/lorenzwalthert/touchstone does a decent job, but we'd need a good set of fast test cases -- can be small initially and grow over time.
For reference, my own analysis, with two other variants.
n <- 1E5
m <- 1E2
vs <- seq(n)
index <- seq(m)
names(index) <- index
value <- list(rev = as.list(rev(index)), nchar = as.list(nchar(index)))
set_value_at <- function(value, idx, length_out) {
out <- value[NULL]
length(out) <- length_out
out[idx] <- value
unname(out)
}
bench::mark(
"This" = {
value_nw1 <- value
for (i in seq_along(value)) {
tmp <- value[[i]]
length(tmp) <- 0
length(tmp) <- length(vs)
tmp[index] <- value[[i]]
value_nw1[[i]] <- tmp
}
lapply(value_nw1, unname)
},
"Rmatch" = {
value_nw2 <- lapply(value, function(x) {
unname(x[match(seq_along(vs), index)])
})
},
"vec_match" = {
value_nw2 <- lapply(value, function(x) {
unname(x[vctrs::vec_match(seq_along(vs), index)])
})
},
"value_at" = {
value_nw3 <- lapply(value, function(x) {
set_value_at(x, index, length(vs))
})
},
"Rbase" = {
value_nw4 <- lapply(
value,
function(x) {
tmp <- x[NULL]
length(tmp) <- length(vs)
tmp[index] <- x
tmp
}
)
lapply(value_nw4, unname)
}
)
#> # A tibble: 5 × 6
#> expression min median `itr/sec` mem_alloc `gc/sec`
#> <bch:expr> <bch:tm> <bch:tm> <dbl> <bch:byt> <dbl>
#> 1 This 2.79ms 3.01ms 316. 8.79MB 68.1
#> 2 Rmatch 1.48ms 1.76ms 543. 4.42MB 125.
#> 3 vec_match 2.36ms 2.62ms 321. 4.99MB 72.9
#> 4 value_at 733.41µs 1.12ms 681. 4.61MB 176.
#> 5 Rbase 774µs 1.15ms 759. 4.59MB 206.
Created on 2024-04-09 with reprex v2.1.0
Looking at the reprex results, the self-contained function doesn't seem to be that bad. With a larger input:
n <- 1E6
m <- 1E3
vs <- seq(n)
index <- seq(m)
names(index) <- index
value <- list(rev = as.list(rev(index)), nchar = as.list(nchar(index)))
set_value_at <- function(value, idx, length_out) {
out <- value[NULL]
length(out) <- length_out
out[idx] <- value
unname(out)
}
bench::mark(
"This" = {
value_nw1 <- value
for (i in seq_along(value)) {
tmp <- value[[i]]
length(tmp) <- 0
length(tmp) <- length(vs)
tmp[index] <- value[[i]]
value_nw1[[i]] <- tmp
}
lapply(value_nw1, unname)
},
"Rmatch" = {
value_nw2 <- lapply(value, function(x) {
unname(x[match(seq_along(vs), index)])
})
},
"vec_match" = {
value_nw2 <- lapply(value, function(x) {
unname(x[vctrs::vec_match(seq_along(vs), index)])
})
},
"value_at" = {
value_nw3 <- lapply(value, function(x) {
set_value_at(x, index, length(vs))
})
},
"Rbase" = {
value_nw4 <- lapply(
value,
function(x) {
tmp <- x[NULL]
length(tmp) <- length(vs)
tmp[index] <- x
tmp
}
)
lapply(value_nw4, unname)
}
)
#> # A tibble: 5 × 6
#> expression min median `itr/sec` mem_alloc `gc/sec`
#> <bch:expr> <bch:tm> <bch:tm> <dbl> <bch:byt> <dbl>
#> 1 This 17.47ms 18.5ms 51.2 50MB 79.6
#> 2 Rmatch 21.03ms 22.5ms 44.3 42.2MB 44.3
#> 3 vec_match 29.86ms 33.8ms 29.9 49.6MB 39.8
#> 4 value_at 7.85ms 9.5ms 104. 45.8MB 173.
#> 5 Rbase 9.34ms 11.5ms 89.6 45.8MB 122.
Created on 2024-04-09 with reprex v2.1.0
If no named list is needed, set_value_at()
can be simplified to
set_value_at2 <- function(value, idx, length_out) {
out <- vector(mode='list', length = length_out)
out[idx] <- value
out
}
Another idea: Using magrittr s pipe %>%, assuming dependency is not a drawback. Update: not faster.
#-------------------------------------------------------------------------------
library(purrr)
library(bench)
n <- 1E6
m <- 1E3
vs <- seq(n)
index <- seq(m)
names(index) <- index
value <- list(rev = as.list(rev(index)), nchar = as.list(nchar(index)))
set_value_at <- function(value, idx, length_out) {
out <- value[NULL]
length(out) <- length_out
out[idx] <- value
unname(out)
}
bench::mark(
"This" = {
value_nw1 <- value
for (i in seq_along(value)) {
tmp <- value[[i]]
length(tmp) <- 0
length(tmp) <- length(vs)
tmp[index] <- value[[i]]
value_nw1[[i]] <- tmp
}
lapply(value_nw1, unname)
},
"Rmatch" = {
value_nw2a <- lapply(value, function(x) {
unname(x[match(seq_along(vs), index)])
})
},
"vec_match" = {
value_nw2b <- lapply(value, function(x) {
unname(x[vctrs::vec_match(seq_along(vs), index)])
})
},
"value_at" = {
value_nw3a <- lapply(value, function(x) {
set_value_at(x, index, length(vs))
})
},
"pipe_at" = {
value_nw3b <- lapply(value, function(x) {
vector(mode='list', length = length(vs)) %>% {.[index] <- x; .}
# set_value_at(x, index, length(vs))
})
},
"Rbase" = {
value_nw4 <- lapply(
value,
function(x) {
tmp <- x[NULL]
length(tmp) <- length(vs)
tmp[index] <- x
tmp
}
)
lapply(value_nw4, unname)
}
)
# n = 1E6, m = 1E3
# A tibble: 6 × 13
# expression min median `itr/sec` mem_alloc `gc/sec` n_itr n_gc total_time result memory time gc
# <bch:expr> <bch:tm> <bch:tm> <dbl> <bch:byt> <dbl> <int> <dbl> <bch:tm> <list> <list> <list> <list>
# 1 This 23.9ms 24.37ms 40.1 45.8MB 45.8 7 8 175ms <named list [2]> <Rprofmem [91 × 3]> <bench_tm [15]> <tibble [15 × 3]>
# 2 Rmatch 22.6ms 22.75ms 43.8 42MB 73.0 6 10 137ms <named list [2]> <Rprofmem [49 × 3]> <bench_tm [16]> <tibble [16 × 3]>
# 3 vec_match 44.5ms 45.06ms 22.1 49.6MB 14.7 6 4 271ms <named list [2]> <Rprofmem [48 × 3]> <bench_tm [10]> <tibble [10 × 3]>
# 4 value_at 13.3ms 13.6ms 70.7 45.8MB 70.7 12 12 170ms <named list [2]> <Rprofmem [76 × 3]> <bench_tm [24]> <tibble [24 × 3]>
# 5 pipe_at 1.8ms 1.93ms 502. 15.3MB 82.9 115 19 229ms <named list [2]> <Rprofmem [42 × 3]> <bench_tm [134]> <tibble [134 × 3]>
# 6 Rbase 13ms 13.5ms 73.8 45.8MB 56.8 13 10 176ms <named list [2]> <Rprofmem [51 × 3]> <bench_tm [23]> <tibble [23 × 3]>
Magrittr s pipe %>% is not faster as I thought before. Using no names seems to be the difference.
library(purrr)
library(bench)
n <- 1E6; m <- 1E3
vs <- seq(n)
index <- seq(m)
names(index) <- index
value <- list(rev = as.list(rev(index)), nchar = as.list(nchar(index)))
set_value_at <- function(value, idx, length_out) {
out <- value[NULL]
length(out) <- length_out
out[idx] <- value
unname(out)
}
set_value_at2 <- function(value, idx, length_out) {
out <- vector(mode='list', length = length_out)
out[idx] <- value
out
}
bench::mark(
"at_unname()" = {
value_nw3a <- lapply(value, function(x) {
set_value_at(x, index, length(vs))
})
},
"at_noname()" = {
value_nw3a <- lapply(value, function(x) {
set_value_at2(x, index, length(vs))
})
},
"at_noname" = {
value_nw3a <- lapply(value, function(x) {
tmp <- vector(mode='list', length = length(vs)); tmp[index] <- x; tmp
})
},
"at_pipe" = {
value_nw3b <- lapply(value, function(x) {
vector(mode='list', length = length(vs)) %>% {.[index] <- x; .}
})
}
)
# A tibble: 4 × 13
# expression min median `itr/sec` mem_alloc `gc/sec` n_itr n_gc total_time
# <bch:expr> <bch:t> <bch:t> <dbl> <bch:byt> <dbl> <int> <dbl> <bch:tm>
# 1 at_unname… 13.31ms 13.7ms 73.3 45.8MB 178. 7 17 95.5ms
# 2 at_noname… 1.79ms 2.68ms 393. 15.3MB 174. 68 30 172.9ms
# 3 at_noname 1.79ms 2.98ms 365. 15.3MB 176. 60 29 164.6ms
# 4 at_pipe 1.81ms 2.87ms 396. 15.3MB 219. 56 31 141.5ms
# ℹ 4 more variables: result <list>, memory <list>, time <list>, gc <list>
see current state of #1330
https://github.com/igraph/rigraph/blob/36ad3c905d6b9dd34fe50b6b2f369ed090ba3c8b/R/attributes.R#L557-L564