trinker / qdap

Quantitative Discourse Analysis Package: Bridging the gap between qualitative data and quantitative analysis
http://cran.us.r-project.org/web/packages/qdap/index.html
175 stars 44 forks source link

Vector indexing vs. hashing (envir) #149

Closed trinker closed 10 years ago

trinker commented 10 years ago

Initial testing indicates that a vector lookup may be a better choice than hashing for items such as syllable lookup:

syn <- list2df(as.list(env.syl))[, 1]
names(syn)  <- list2df(as.list(env.syl))[, 2]

dat <- bag_o_words(pres_debate_raw2012$dialogue)

vect=syn[dat]
hash=hash_look(dat, env.syl)

Unit: milliseconds
 expr      min       lq   median       uq      max neval
 vect 35.46802 35.54943 35.66862 35.74349 36.67534   100
 hash 45.01694 45.22314 45.39621 45.70317 49.95538   10
Dasonk commented 10 years ago

It might depend on the size of the lookup table. You might want to vary the lookup table size when doing your benchmarking.

trinker commented 10 years ago

I'll do some more investigating on varying sized vectors. I may use an ifelse and handle based on these findings.

trinker commented 10 years ago

Good call Dason. Here's a larger vector:

syn <- list2df(as.list(env.syl))[, 1]
names(syn)  <- list2df(as.list(env.syl))[, 2]

dat <- bag_o_words(raj$dialogue)

vect=syn[dat]
hash=hash_look(dat, env.syl) 
Unit: milliseconds
 expr      min       lq   median       uq      max neval
 vect 230.3721 230.6975 231.3035 232.3678 251.2343    20
 hash 137.1972 143.8174 145.0490 150.2753 197.6331    20

Here are the lengths for the 2 vectors:

sapply(list(
    bag_o_words(pres_debate_raw2012$dialogue), 
    bag_o_words(raj$dialogue)), length)
[1]  8309 23783

So 8,309 vs 23,783 or 4 times the size of the first one.

trinker commented 10 years ago

Ok so using some replication I get that there's a difference in performance after about 15000 words:

Code

library (pacman)
p_load(embodied, matlab, dplyr)
dat <- bag_o_words(raj$dialogue)

n <- c(seq(7000, 20000, by=1000), length(dat))
n <- rep(n, each=2)

pars <- c(replicate(length(n)/2, sample(c(TRUE, FALSE), 2)))

FUN <- function() {
    comps <- lapply(1:length(n), function(i) {
        cat(paste("\n=============\n", i, " in ", length(n), "\n=============\n\n"))
        tic(gcFirst=FALSE) 

        if (pars[i]) {
            syn[dat[1:c(n[i])]]
        } else {
            hash_look(dat[1:c(n[i])], env.syl) 
        }

        as.numeric(gsub("[a-z]|\\s+", "", capture.output(toc(echo=TRUE) )))
    })

    data.frame(Approach = c("hash", "vector")[as.numeric(pars) + 1], 
        n = n, time = unlist(comps))
}

out <- lapply(1:100, function(i) FUN())

dat2 <- do.call(rbind, out)

dat3 <- dat2 %.%
    group_by(Approach, n) %.%
    summarise(ave=mean(time)) %.%
    arrange(n, Approach)
dat3

ggplot(dat2, aes(x=n, y=time, group=Approach, colour=Approach)) + 
     geom_point(size=2, alpha=.2, shape=19) + 
     geom_jitter(position = position_jitter(width=200, height = 0)) +
     geom_line(data=dat3, aes(x=n, y=ave, group=Approach, 
        colour=Approach), size=1.5) +
    ggtitle("Vector vs. Hash Approaches to lookup") +
    xlab("Number of Words") + ylab("Time (sec)")

ggsave("output.png", height=5, width=7)