Open pierreroudier opened 6 years ago
I'd better do some proper testing, because the gain is not obvious:
# cluster::daisy
system.time( as.matrix(cluster::daisy(mtcars, metric = "gower") ))
user system elapsed
0.002 0.000 0.003
# gower::gower
system.time( lapply(1:32, function(i) gower::gower_dist(mtcars[i, , drop = F], mtcars[-i, ])) )
user system elapsed
0.049 0.000 0.019
This need to be confirmed for larger datasets (eg n > 1000).
It sounds like there's benefit for larger datasets:
# N = 5000
system.time( lapply(1:5000, function(i) gower::gower_dist(ggplot2::diamonds[i, , drop = F], mtcars[-i, ])) )
user system elapsed
1.113 0.000 1.113
system.time( as.matrix(cluster::daisy(ggplot2::diamonds[1:5000,], metric = "gower") ))
user system elapsed
3.125 0.661 3.786
# N = 10000
system.time( lapply(1:10000, function(i) gower::gower_dist(ggplot2::diamonds[i, , drop = F], mtcars[-i, ])) )
user system elapsed
2.242 0.000 2.243
system.time( as.matrix(cluster::daisy(ggplot2::diamonds[1:10000,], metric = "gower") ))
user system elapsed
10.844 2.800 13.649
> system.time( as.matrix(cluster::daisy(ggplot2::diamonds[1:20000,], metric = "gower") ))
user system elapsed
74.550 12.313 86.906
> system.time( lapply(1:20000, function(i) gower::gower_dist(ggplot2::diamonds[i, , drop = F], mtcars[-i, ])) )
user system elapsed
4.540 0.000 4.541
This is great! How much trouble is it to replace daisy with gower_dist? How about variable standardization that is built-in to cluster::daisy
?
@dylanbeaudette the only drawback I can think of is the loss of the other distances (euclidean
and manhattan
options in daisy
).
I would think the standardisation would be fast and easy to implement (famous last words!)?
By the sounds of it the
gower
package should be faster.