pierreroudier / clhs

A R implementation of the conditioned Latin Hypercube Sampling method
12 stars 9 forks source link

Switch Gower's distance computation to the `gower` package #8

Open pierreroudier opened 6 years ago

pierreroudier commented 6 years ago

By the sounds of it the gower package should be faster.

pierreroudier commented 6 years ago

I'd better do some proper testing, because the gain is not obvious:

# cluster::daisy
system.time( as.matrix(cluster::daisy(mtcars, metric = "gower") ))
   user  system elapsed 
  0.002   0.000   0.003 

# gower::gower
system.time( lapply(1:32, function(i) gower::gower_dist(mtcars[i, , drop = F], mtcars[-i, ])) )
   user  system elapsed 
  0.049   0.000   0.019 

This need to be confirmed for larger datasets (eg n > 1000).

pierreroudier commented 6 years ago

It sounds like there's benefit for larger datasets:

# N = 5000
system.time( lapply(1:5000, function(i) gower::gower_dist(ggplot2::diamonds[i, , drop = F], mtcars[-i, ])) )
   user  system elapsed 
  1.113   0.000   1.113 
system.time( as.matrix(cluster::daisy(ggplot2::diamonds[1:5000,], metric = "gower") ))
   user  system elapsed 
  3.125   0.661   3.786 

# N = 10000
system.time( lapply(1:10000, function(i) gower::gower_dist(ggplot2::diamonds[i, , drop = F], mtcars[-i, ])) )
   user  system elapsed 
  2.242   0.000   2.243 
system.time( as.matrix(cluster::daisy(ggplot2::diamonds[1:10000,], metric = "gower") ))
   user  system elapsed 
 10.844   2.800  13.649 
pierreroudier commented 6 years ago
> system.time( as.matrix(cluster::daisy(ggplot2::diamonds[1:20000,], metric = "gower") ))
   user  system elapsed 
 74.550  12.313  86.906 
> system.time( lapply(1:20000, function(i) gower::gower_dist(ggplot2::diamonds[i, , drop = F], mtcars[-i, ])) )
   user  system elapsed 
  4.540   0.000   4.541
dylanbeaudette commented 6 years ago

This is great! How much trouble is it to replace daisy with gower_dist? How about variable standardization that is built-in to cluster::daisy?

pierreroudier commented 6 years ago

@dylanbeaudette the only drawback I can think of is the loss of the other distances (euclidean and manhattan options in daisy).

I would think the standardisation would be fast and easy to implement (famous last words!)?