Closed trinker closed 10 years ago
It might depend on the size of the lookup table. You might want to vary the lookup table size when doing your benchmarking.
I'll do some more investigating on varying sized vectors. I may use an ifelse
and handle based on these findings.
Good call Dason. Here's a larger vector:
syn <- list2df(as.list(env.syl))[, 1]
names(syn) <- list2df(as.list(env.syl))[, 2]
dat <- bag_o_words(raj$dialogue)
vect=syn[dat]
hash=hash_look(dat, env.syl)
Unit: milliseconds
expr min lq median uq max neval
vect 230.3721 230.6975 231.3035 232.3678 251.2343 20
hash 137.1972 143.8174 145.0490 150.2753 197.6331 20
Here are the lengths for the 2 vectors:
sapply(list(
bag_o_words(pres_debate_raw2012$dialogue),
bag_o_words(raj$dialogue)), length)
[1] 8309 23783
So 8,309
vs 23,783
or 4 times the size of the first one.
Ok so using some replication I get that there's a difference in performance after about 15000 words:
Code
library (pacman)
p_load(embodied, matlab, dplyr)
dat <- bag_o_words(raj$dialogue)
n <- c(seq(7000, 20000, by=1000), length(dat))
n <- rep(n, each=2)
pars <- c(replicate(length(n)/2, sample(c(TRUE, FALSE), 2)))
FUN <- function() {
comps <- lapply(1:length(n), function(i) {
cat(paste("\n=============\n", i, " in ", length(n), "\n=============\n\n"))
tic(gcFirst=FALSE)
if (pars[i]) {
syn[dat[1:c(n[i])]]
} else {
hash_look(dat[1:c(n[i])], env.syl)
}
as.numeric(gsub("[a-z]|\\s+", "", capture.output(toc(echo=TRUE) )))
})
data.frame(Approach = c("hash", "vector")[as.numeric(pars) + 1],
n = n, time = unlist(comps))
}
out <- lapply(1:100, function(i) FUN())
dat2 <- do.call(rbind, out)
dat3 <- dat2 %.%
group_by(Approach, n) %.%
summarise(ave=mean(time)) %.%
arrange(n, Approach)
dat3
ggplot(dat2, aes(x=n, y=time, group=Approach, colour=Approach)) +
geom_point(size=2, alpha=.2, shape=19) +
geom_jitter(position = position_jitter(width=200, height = 0)) +
geom_line(data=dat3, aes(x=n, y=ave, group=Approach,
colour=Approach), size=1.5) +
ggtitle("Vector vs. Hash Approaches to lookup") +
xlab("Number of Words") + ylab("Time (sec)")
ggsave("output.png", height=5, width=7)
Initial testing indicates that a vector lookup may be a better choice than hashing for items such as syllable lookup: