slycoder / R-lda

Latent Dirichlet allocation package for R
16 stars 23 forks source link

Investigate improving performance #8

Closed slycoder closed 7 years ago

slycoder commented 8 years ago

This is apparently significantly faster:

https://github.com/dselivanov/text2vec/blob/0.4/src/LDA_gibbs.cpp

Should figure out which of the removed things had such an impact.

dselivanov commented 8 years ago

Also I think it will be not too hard to implement parallel LDA fitting using AD-LDA scheme. I didn't investigate deeply, but seem it should be straightforward. Will happy to contribute.

dselivanov commented 8 years ago

@slycoder also I discovered your rtm package which supposed to implement sparseLDA ( described by Mimno et al). What is the status of this work?

slycoder commented 7 years ago

Ok, cool, I've found the issue and committed to master. At least in the little benchmark I put together, the time went from around 4.2s to 1.7s. Not too shabby!

slycoder commented 7 years ago

I'll be pushing this out to CRAN as soon as I get a chance.

As for rtm, I haven't put much effort into it unfortunately and probably won't get the time to do so any time soon =(.

dselivanov commented 7 years ago

thanks for investigation!

slycoder commented 7 years ago

I assumed it would be inlined and optimized but I had too much faith in clang perhaps.

dselivanov commented 7 years ago

Still slower...

# library(devtools)
# install_github("slycoder/R-lda-deprecated@1f872be29e09e513621aa13f9608ff4f864d598e")
# install_github("dselivanov/text2vec@22c08f4094e5a760aa8f6a975d9e315c597b40de")
library(text2vec)
library(lda)
data("movie_review")
it = itoken(movie_review$review, tolower, word_tokenizer)
v = create_vocabulary(it) %>% 
  prune_vocabulary(term_count_min = 20)
dtm = create_dtm(it, vocab_vectorizer(v), type = "lda_c")

K = 100
alpha = 1/K
eta = 1/K
n_iter = 10

lda = LDA$new(K, v)
lda$verbose = TRUE
set.seed(1)
system.time({
  lda$fit(dtm, n_iter = n_iter, check_convergence_every_n = 0 )
})
#user  system elapsed 
#4.577   0.013   4.594 

set.seed(1)
system.time({
  m = lda.collapsed.gibbs.sampler(documents = dtm, K = K, vocab = v$vocab$terms, 
                                  num.iterations = n_iter, alpha = alpha, eta = eta, 
                                  compute.log.likelihood = FALSE, trace = 2L)
})
#user  system elapsed 
#8.494   0.037   8.550
slycoder commented 7 years ago

Weird. I just tried your code and got:

user system elapsed 3.527 0.012 3.518

user system elapsed 3.760 0.017 3.739

dselivanov commented 7 years ago

Hm, let me double check...

dselivanov commented 7 years ago

Really strange - got

# user  system elapsed 
#  5.358   0.011   5.369 

and

#   user  system elapsed 
# 10.103   0.046  10.185

Mb answer in compile flags (so mb compiler can optimize better in text2vec)? I have pretty aggressive optimizations in ~/.R/Makevars:

CXX1XFLAGS += -march=native -ffast-math -Ofast -mtune=native
CXXFLAGS += -march=native -ffast-math -Ofast -mtune=native
CFLAGS += -march=native -ffast-math -Ofast -mtune=native
slycoder commented 7 years ago

Interesting. Yeah, when I add those flags things get much slower. Lemme try to figure out which flag is the culprit.

slycoder commented 7 years ago

Ha, maybe most of the flags are the culprit? Here's what I got trying flags on their own.

Nothing: 1.748

-Ofast 2.906

-march-native 3.52

-mtune-native 1.798

-ffast-math 2.920

slycoder commented 7 years ago

(This is on a slightly simpler test that I just concocted)

slycoder commented 7 years ago

I'm curious, when you get a chance if you could test with different flags to see if I'm going crazy =).

dselivanov commented 7 years ago

I can confirm: anything except default -O2 (I even tried -O1) slows down from 4.4 sec to 7.5+ sec. More options = more runtime :-D. Weird!

dselivanov commented 7 years ago

@slycoder switched from apple clang++/clang to gcc-6/g++-6. This solved all strange problems!

slycoder commented 7 years ago

FYI, just made this change which seems a tiny bit faster for the default and a lot faster with -march=native:

https://github.com/slycoder/R-lda-deprecated/commit/75172ee06ed66ae2a1b2614a28aa067197cc1520

dselivanov commented 7 years ago

With gcc-6 didn't notice any difference with previous commit. On my system best results on example above with gcc-6 and CFLAGS += -march=native -mtune=native -mavx -ffast-math -O3 (actually -ffast-math -O3 is enough).

Small benchmark

aggressive CFLAGS += -march=native -mtune=native -mavx -ffast-math -O3 options:

gcc-6 https://github.com/slycoder/R-lda-deprecated/commit/63df45ac9b1cd5b9f4b3bce9eb07f45bc8e96a65 - 7.2 sec https://github.com/slycoder/R-lda-deprecated/commit/75172ee06ed66ae2a1b2614a28aa067197cc1520 - 3.5 sec

clang: https://github.com/slycoder/R-lda-deprecated/commit/63df45ac9b1cd5b9f4b3bce9eb07f45bc8e96a65 - 11.479 sec https://github.com/slycoder/R-lda-deprecated/commit/75172ee06ed66ae2a1b2614a28aa067197cc1520 - 4.5 sec

clang-3.8: https://github.com/slycoder/R-lda-deprecated/commit/63df45ac9b1cd5b9f4b3bce9eb07f45bc8e96a65 - 11.327 sec https://github.com/slycoder/R-lda-deprecated/commit/75172ee06ed66ae2a1b2614a28aa067197cc1520 - 4.4 sec

default CFLAGS = -mtune=core2 -O2:

gcc-6 https://github.com/slycoder/R-lda-deprecated/commit/63df45ac9b1cd5b9f4b3bce9eb07f45bc8e96a65 - 8.4 sec https://github.com/slycoder/R-lda-deprecated/commit/75172ee06ed66ae2a1b2614a28aa067197cc1520 - 4.6 sec

clang: https://github.com/slycoder/R-lda-deprecated/commit/63df45ac9b1cd5b9f4b3bce9eb07f45bc8e96a65 - 11.1 sec https://github.com/slycoder/R-lda-deprecated/commit/75172ee06ed66ae2a1b2614a28aa067197cc1520 - 4.6 sec

clang-3.8: https://github.com/slycoder/R-lda-deprecated/commit/63df45ac9b1cd5b9f4b3bce9eb07f45bc8e96a65 - 11.1 sec https://github.com/slycoder/R-lda-deprecated/commit/75172ee06ed66ae2a1b2614a28aa067197cc1520 - 4.4 sec