igraph / rigraph

igraph R package
https://r.igraph.org
532 stars 200 forks source link

refactor puzzling loop in attributes.R #1327

Closed maelle closed 1 day ago

maelle commented 3 months ago

https://github.com/igraph/rigraph/blob/36ad3c905d6b9dd34fe50b6b2f369ed090ba3c8b/R/attributes.R#L557-L564

maelle commented 3 months ago

L559 can be deleted, the rest is more difficult.

maelle commented 3 months ago

value is a named list.

the code above is for the case where we don't have as many values as vertices. we create a list as long as the number of vertices. then the indices are used to fill the list elements. the other elements remain NULL.

surely there's an elegant way to do this?!

maelle commented 3 months ago

elegant and readable

maelle commented 3 months ago

first fill with NA, then make NA NULL? :thinking: no, they are NA in the current version.

maelle commented 3 months ago

a good thing is that the lines are covered by tests

maelle commented 3 months ago

first make the loop easier then use purrr::reduce?

maelle commented 3 months ago
      tmp <- rep(NA, length(vs))
      tmp[index] <- value[[i]]
      value[[i]] <- tmp
maelle commented 3 months ago

oooh for value a vector, we get NA as fillers but for value a named list, we get NULL as fillers.

clpippel commented 3 months ago

My attempt to isolate the essence of the coding:

vs           <- c(1,2,3,4)
index        <- c(1,4)
value        <- list(list(a= 10, d = 14))
names(value) <- "color"

i            <- 1
tmp          <- value[[i]]         # Copy structure of value.
length(tmp)  <- 0                  # Set list to empty, while removing names.
length(tmp)  <- length(vs)         # Extend to all vertices with NULL.
tmp[index]   <- value[[i]]         # Copy indexed elements,
value[[1]]   <- tmp                # into value.
names(value[[i]])

The effect of length(tmp) <- 0 is to remove the names initially present in value. When leaving out the statement the names in value in this example are: [1] "a" "d" "" "" .

In this example using the old logic the expression >value[[1]]["a"] will give

$<NA>
NULL

thereby blocking indexing by name. In the new logic the result will be:

$a
[1] 10

Removing the names could be a matter of defensive coding. All the code need is some explanation.

clpippel commented 3 months ago

Create an empty named list.

L558, L559 are equivalent to: tmp <- setNames(list(), character(0)) # Easy to read.   or alternatively tmp <- list(dummy = 1)[NULL] # Faster.   or tmp <- value[[i]][NULL] # As current coding.

To replace lines L558, L559, L560 by a single line: tmp <- 'length<-'(list(dummy = 1)[NULL], length(vs))

maelle commented 3 months ago

thanks! @krlmlr helped me shorten the code, I'll soon finish up the related PR.

clpippel commented 3 months ago

My thoughts. The proposed solution is indeed concise. It uses the match() function, which can be costly if the graph has many vertices. A similar solution can be achieved in base R by lapply() instead of map(). A base R solution using indexing by a vector is more verbose, but faster.

# -------------------------------
library(purrr)
library(microbenchmark)
n <- 1E5; m <- 1E2;
vs                <- seq(n)
index             <- seq(m)
names(index)      <- index
value             <- list(rev = as.list(rev(index)), nchar = as.list(nchar(index)) )

microbenchmark(
"This"  = { value_nw1 <- value
           for (i in seq_along(value)) { 
             tmp            <- value[[i]] 
             length(tmp)    <- 0 
             length(tmp)    <- length(vs) 
             tmp[index]     <- value[[i]] 
             value_nw1[[i]] <- tmp 
           } 
          },
"Purrr"  = { value_nw2 <- purrr::map(value, ~.x[match(seq_along(vs), index)]) },
"Rmatch" = { value_nw3 <- lapply(value, function(x) { x[match(seq_along(vs), index)]} ) },
"Rbase"  = { value_nw4 <-
  lapply(
    value,
    function(x)
      { tmp <- x[NULL]; length(tmp) <- length(vs); tmp[index] <- x; tmp}
  )
},
times= 100,
unit="relative"
)
# Unit: relative
#    expr      min       lq     mean   median       uq       max neval
#    This 9.187689 6.649874 4.097758 6.472774 5.603021 1.0953775   100
#   Purrr 4.552082 3.618245 2.618367 3.620971 3.551356 0.9894297   100
#  Rmatch 3.655005 3.473635 2.954367 3.458746 3.226840 1.0147469   100
#   Rbase 1.000000 1.000000 1.000000 1.000000 1.000000 1.0000000   100
identical(value_nw2, value_nw3)
# [1] TRUE
identical(value_nw1, value_nw4)
# [1] TRUE
maelle commented 2 months ago

note to self: use rlang's "purrr" standalone file

krlmlr commented 2 months ago

Thanks, good catch. Using match() indeed causes an overhead. Let's go with lapply() (or map() from https://github.com/r-lib/rlang/blob/main/R/standalone-purrr.R) plus the Rbase variant plus comments why we chose this solution: don't want to slow down code that is used for many operations.

For continuous benchmarking, https://github.com/lorenzwalthert/touchstone does a decent job, but we'd need a good set of fast test cases -- can be small initially and grow over time.

krlmlr commented 2 months ago

For reference, my own analysis, with two other variants.

n <- 1E5
m <- 1E2
vs <- seq(n)
index <- seq(m)
names(index) <- index
value <- list(rev = as.list(rev(index)), nchar = as.list(nchar(index)))

set_value_at <- function(value, idx, length_out) {
  out <- value[NULL]
  length(out) <- length_out
  out[idx] <- value
  unname(out)
}

bench::mark(
  "This" = {
    value_nw1 <- value
    for (i in seq_along(value)) {
      tmp <- value[[i]]
      length(tmp) <- 0
      length(tmp) <- length(vs)
      tmp[index] <- value[[i]]
      value_nw1[[i]] <- tmp
    }
    lapply(value_nw1, unname)
  },
  "Rmatch" = {
    value_nw2 <- lapply(value, function(x) {
      unname(x[match(seq_along(vs), index)])
    })
  },
  "vec_match" = {
    value_nw2 <- lapply(value, function(x) {
      unname(x[vctrs::vec_match(seq_along(vs), index)])
    })
  },
  "value_at" = {
    value_nw3 <- lapply(value, function(x) {
      set_value_at(x, index, length(vs))
    })
  },
  "Rbase" = {
    value_nw4 <- lapply(
      value,
      function(x) {
        tmp <- x[NULL]
        length(tmp) <- length(vs)
        tmp[index] <- x
        tmp
      }
    )
    lapply(value_nw4, unname)
  }
)
#> # A tibble: 5 × 6
#>   expression      min   median `itr/sec` mem_alloc `gc/sec`
#>   <bch:expr> <bch:tm> <bch:tm>     <dbl> <bch:byt>    <dbl>
#> 1 This         2.79ms   3.01ms      316.    8.79MB     68.1
#> 2 Rmatch       1.48ms   1.76ms      543.    4.42MB    125. 
#> 3 vec_match    2.36ms   2.62ms      321.    4.99MB     72.9
#> 4 value_at   733.41µs   1.12ms      681.    4.61MB    176. 
#> 5 Rbase         774µs   1.15ms      759.    4.59MB    206.

Created on 2024-04-09 with reprex v2.1.0

krlmlr commented 2 months ago

Looking at the reprex results, the self-contained function doesn't seem to be that bad. With a larger input:

n <- 1E6
m <- 1E3
vs <- seq(n)
index <- seq(m)
names(index) <- index
value <- list(rev = as.list(rev(index)), nchar = as.list(nchar(index)))

set_value_at <- function(value, idx, length_out) {
  out <- value[NULL]
  length(out) <- length_out
  out[idx] <- value
  unname(out)
}

bench::mark(
  "This" = {
    value_nw1 <- value
    for (i in seq_along(value)) {
      tmp <- value[[i]]
      length(tmp) <- 0
      length(tmp) <- length(vs)
      tmp[index] <- value[[i]]
      value_nw1[[i]] <- tmp
    }
    lapply(value_nw1, unname)
  },
  "Rmatch" = {
    value_nw2 <- lapply(value, function(x) {
      unname(x[match(seq_along(vs), index)])
    })
  },
  "vec_match" = {
    value_nw2 <- lapply(value, function(x) {
      unname(x[vctrs::vec_match(seq_along(vs), index)])
    })
  },
  "value_at" = {
    value_nw3 <- lapply(value, function(x) {
      set_value_at(x, index, length(vs))
    })
  },
  "Rbase" = {
    value_nw4 <- lapply(
      value,
      function(x) {
        tmp <- x[NULL]
        length(tmp) <- length(vs)
        tmp[index] <- x
        tmp
      }
    )
    lapply(value_nw4, unname)
  }
)
#> # A tibble: 5 × 6
#>   expression      min   median `itr/sec` mem_alloc `gc/sec`
#>   <bch:expr> <bch:tm> <bch:tm>     <dbl> <bch:byt>    <dbl>
#> 1 This        17.47ms   18.5ms      51.2      50MB     79.6
#> 2 Rmatch      21.03ms   22.5ms      44.3    42.2MB     44.3
#> 3 vec_match   29.86ms   33.8ms      29.9    49.6MB     39.8
#> 4 value_at     7.85ms    9.5ms     104.     45.8MB    173. 
#> 5 Rbase        9.34ms   11.5ms      89.6    45.8MB    122.

Created on 2024-04-09 with reprex v2.1.0

clpippel commented 2 months ago

If no named list is needed, set_value_at() can be simplified to

set_value_at2 <- function(value, idx, length_out) {
  out      <- vector(mode='list', length = length_out)
  out[idx] <- value
  out
}
clpippel commented 2 months ago

Another idea: Using magrittr s pipe %>%, assuming dependency is not a drawback. Update: not faster.

#-------------------------------------------------------------------------------
library(purrr)
library(bench)

n <- 1E6
m <- 1E3
vs <- seq(n)
index <- seq(m)
names(index) <- index
value <- list(rev = as.list(rev(index)), nchar = as.list(nchar(index)))

set_value_at <- function(value, idx, length_out) {
  out <- value[NULL]
  length(out) <- length_out
  out[idx] <- value
  unname(out)
}

bench::mark(
  "This" = {
    value_nw1 <- value
    for (i in seq_along(value)) {
      tmp <- value[[i]]
      length(tmp) <- 0
      length(tmp) <- length(vs)
      tmp[index] <- value[[i]]
      value_nw1[[i]] <- tmp
    }
    lapply(value_nw1, unname)
  },
  "Rmatch" = {
    value_nw2a <- lapply(value, function(x) {
      unname(x[match(seq_along(vs), index)])
    })
  },
  "vec_match" = {
    value_nw2b <- lapply(value, function(x) {
      unname(x[vctrs::vec_match(seq_along(vs), index)])
    })
  },
  "value_at" = {
    value_nw3a <- lapply(value, function(x) {
      set_value_at(x, index, length(vs))
    })
  },
  "pipe_at" = {
    value_nw3b <- lapply(value, function(x) { 
      vector(mode='list', length = length(vs)) %>% {.[index] <- x; .}
      # set_value_at(x, index, length(vs))
    })
  },
  "Rbase" = {
    value_nw4 <- lapply(
      value,
      function(x) {
        tmp <- x[NULL]
        length(tmp) <- length(vs)
        tmp[index] <- x
        tmp
      }
    )
    lapply(value_nw4, unname)
  }
)
# n = 1E6, m = 1E3
# A tibble: 6 × 13
#   expression      min   median `itr/sec` mem_alloc `gc/sec` n_itr  n_gc total_time result           memory              time             gc                
#   <bch:expr> <bch:tm> <bch:tm>     <dbl> <bch:byt>    <dbl> <int> <dbl>   <bch:tm> <list>           <list>              <list>           <list>            
# 1 This         23.9ms  24.37ms      40.1    45.8MB     45.8     7     8      175ms <named list [2]> <Rprofmem [91 × 3]> <bench_tm [15]>  <tibble [15 × 3]> 
# 2 Rmatch       22.6ms  22.75ms      43.8      42MB     73.0     6    10      137ms <named list [2]> <Rprofmem [49 × 3]> <bench_tm [16]>  <tibble [16 × 3]> 
# 3 vec_match    44.5ms  45.06ms      22.1    49.6MB     14.7     6     4      271ms <named list [2]> <Rprofmem [48 × 3]> <bench_tm [10]>  <tibble [10 × 3]> 
# 4 value_at     13.3ms   13.6ms      70.7    45.8MB     70.7    12    12      170ms <named list [2]> <Rprofmem [76 × 3]> <bench_tm [24]>  <tibble [24 × 3]> 
# 5 pipe_at       1.8ms   1.93ms     502.     15.3MB     82.9   115    19      229ms <named list [2]> <Rprofmem [42 × 3]> <bench_tm [134]> <tibble [134 × 3]>
# 6 Rbase          13ms   13.5ms      73.8    45.8MB     56.8    13    10      176ms <named list [2]> <Rprofmem [51 × 3]> <bench_tm [23]>  <tibble [23 × 3]> 
clpippel commented 2 months ago

Magrittr s pipe %>% is not faster as I thought before. Using no names seems to be the difference.

library(purrr)
library(bench)

n <- 1E6; m <- 1E3
vs <- seq(n)
index <- seq(m)
names(index) <- index
value <- list(rev = as.list(rev(index)), nchar = as.list(nchar(index)))

set_value_at <- function(value, idx, length_out) {
  out <- value[NULL]
  length(out) <- length_out
  out[idx] <- value
  unname(out)
}

set_value_at2 <- function(value, idx, length_out) {
  out      <- vector(mode='list', length = length_out)
  out[idx] <- value
  out
}

bench::mark(
  "at_unname()" = {
    value_nw3a <- lapply(value, function(x) {
      set_value_at(x, index, length(vs))
    })
  },
  "at_noname()" = {
    value_nw3a <- lapply(value, function(x) {
      set_value_at2(x, index, length(vs))
    })
  },
  "at_noname" = {
    value_nw3a <- lapply(value, function(x) {
      tmp <- vector(mode='list', length = length(vs)); tmp[index] <- x; tmp
    })
  },
  "at_pipe" = {
    value_nw3b <- lapply(value, function(x) { 
      vector(mode='list', length = length(vs)) %>% {.[index] <- x; .}
    })
  }
)

# A tibble: 4 × 13
#   expression     min  median `itr/sec` mem_alloc `gc/sec` n_itr  n_gc total_time
#   <bch:expr> <bch:t> <bch:t>     <dbl> <bch:byt>    <dbl> <int> <dbl>   <bch:tm>
# 1 at_unname… 13.31ms  13.7ms      73.3    45.8MB     178.     7    17     95.5ms
# 2 at_noname…  1.79ms  2.68ms     393.     15.3MB     174.    68    30    172.9ms
# 3 at_noname   1.79ms  2.98ms     365.     15.3MB     176.    60    29    164.6ms
# 4 at_pipe     1.81ms  2.87ms     396.     15.3MB     219.    56    31    141.5ms
# ℹ 4 more variables: result <list>, memory <list>, time <list>, gc <list>
maelle commented 2 months ago

see current state of #1330