Consider parallelizing tokenization

koheiw commented 4 years ago

There is a package called future.apply which provides parallelized apply-type functions. It seems that we can parallelize tokenization with future_lapply().

require(quanteda)
require(future.apply)
plan(multiprocess)

> corp <- readRDS("/home/kohei/Dropbox/Brexit/Data/data_corpus_guardian_2016.RDS")
> #corp <- quanteda.corpora::download("data_corpus_guardian")
> txt <- texts(corp)
> length(txt)
[1] 60045
> 
> chunks <- split(txt, rep_len(1:4, length(txt)))
> microbenchmark::microbenchmark(
+   do.call(c, future_lapply(chunks, tokens, remove_separator = FALSE)),
+   quanteda::tokens(txt, remove_separator = FALSE),
+   times = 1
+ )
Unit: seconds
                                                                expr       min        lq      mean    median
 do.call(c, future_lapply(chunks, tokens, remove_separator = FALSE))  68.73448  68.73448  68.73448  68.73448
                     quanteda::tokens(txt, remove_separator = FALSE) 105.10601 105.10601 105.10601 105.10601
        uq       max neval
  68.73448  68.73448     1
 105.10601 105.10601     1

kbenoit commented 4 years ago

I saw a presentation on this remarkable package at the RStudio::conf in January, was thinking exactly the same!

koheiw commented 4 years ago

I experimented with parallelization in different ways. My initial idea was call stri_split_boundaries() in parallel but it was slower, probably because of the large object size (list of character vectors). So, serialization should be done before returning the tokens. This means we have to reassign token IDs very quickly to combine multiple objects. tokens.c() is not terrible slow, but it can be done more efficiently.

kbenoit commented 4 years ago

I found the same using other parallelisation methods (from R): the vectorized stri_split_boundaries() was still fastest.

Still we batch the tokens now, maybe that could be executed in parallel?

koheiw commented 4 years ago

Yes, this part will be the target of parallelization.

https://github.com/quanteda/quanteda/blob/70ceece7f93901e60e7cd67fe88ff97d17306e68/R/tokens.R#L274-L286

but serialization depends on attr(x[[i - 1]], "types"), so need to make recompiliation very fast, especially

https://github.com/quanteda/quanteda/blob/70ceece7f93901e60e7cd67fe88ff97d17306e68/R/tokens-methods-base.R#L184-L185

I think I can do this in C++.

koheiw commented 4 years ago

require(quanteda)

# parallel 1 (tokenize and serialize in R)
toks1 <- tokens("a b c")
toks2 <- tokens("d b c")

type1 <- types(toks1)
type2 <- types(toks2)
type <- unique(c(type1, type2))

# parallel 2 (re-serialize in C++) 
map1 <- match(type1, type) 
toks1 <- lapply(unclass(toks1), function(x) map1[x])
map2 <- match(type2, type) 
toks2 <- lapply(unclass(toks2), function(x) map2[x])

toks <- c(toks1, toks2)
quanteda:::build_tokens(toks, type, docvars = quanteda:::make_docvars(length(toks)))

koheiw commented 4 years ago

Actually, parallel C++ or R is not faster than serial R in simple remapping.

corp <- readRDS("~/Dropbox/Public/data_corpus_guardian.rds")
toks <- tokens(corp)

type <- types(toks)
toks_ns <- tokens_remove(toks, stopwords("en"), padding =  TRUE)
map <- match(types(toks_ns), type)

out1 <- lapply(unclass(toks_ns), function(x, y) y[x + 1], c(0, map))
out2 <- quanteda:::qatd_cpp_tokens_remap(toks_ns, type, map)
all(unclass(out1)[[1]] == unclass(out2)[[1]])

microbenchmark::microbenchmark(
    r1 = lapply(unclass(toks_ns), function(x, y) y[x + 1], c(0, map)),
    r2 = future_lapply(unclass(toks_ns), function(x, y) y[x + 1], c(0, map)),
    c = quanteda:::qatd_cpp_tokens_remap(toks_ns, type, map),
    times = 10
)

Unit: milliseconds
 expr       min       lq     mean    median        uq       max neval
   r1  57.46527  77.2548 126.5122  82.37405  96.35264  515.8249    10
   r2 737.69280 762.3968 918.2059 817.48856 934.99906 1456.0851    10
    c 101.59673 130.9325 162.7712 133.56834 193.79469  285.9912    10

koheiw commented 4 years ago

@kbenoit please try tokens_parallel() in the dev-tokens_parallel branch. It seems about three times faster on a machine with 4 cores.

require(quanteda)
require(future)
corp <- readRDS("~/Dropbox/Public/data_corpus_guardian2016.RDS")
txt <- head(texts(corp), 50000)
plan(multiprocess)
microbenchmark::microbenchmark(
    tokens_parallel(txt),
    tokens(txt),
    times = 10
)

Unit: seconds
                 expr       min        lq      mean    median        uq       max neval
 tokens_parallel(txt)  54.20247  54.48312  56.63494  54.93698  60.25164  62.04019    10
          tokens(txt) 142.66098 144.56481 147.77326 147.93417 150.82376 153.92902    10

kbenoit commented 4 years ago

I get, on macOS, running latest R 4.0:

> plan(multiprocess)
Warning message:
[ONE-TIME WARNING] Forked processing ('multicore') is disabled in future (>= 1.13.0) when running R from RStudio, because it is considered unstable. Because of this, plan("multicore") will fall back to plan("sequential"), and plan("multiprocess") will fall back to plan("multisession") - not plan("multicore") as in the past. For more details, how to control forked processing or not, and how to silence this warning in future R sessions, see ?future::supportsMulticore

Then I get this output:

                 expr      min       lq     mean   median       uq      max neval cld
 tokens_parallel(txt) 13.47222 13.58307 15.43100 13.87595 14.26223 28.77037    10  a 
          tokens(txt) 25.30120 25.42088 26.05501 25.89671 26.46597 27.73434    10   b

which is nearly 2x as fast despite the RStudio disabling multi-core!

details:

> data_corpus_guardian
Corpus consisting of 177,115 documents and 9 docvars.

koheiw commented 4 years ago

We might need to set options(mc.cores=8) for future_lapply too, but already promising! The advantage of parallel lapply becomes greater when ndoc is larger.

koheiw commented 4 years ago

Interestingly, parallel::mclapply outperformed future_lapply() when executed by Rscript.

Unit: seconds
                                  expr       min        lq      mean    median
                      tokens_test(txt) 100.38170 102.46677 106.11569 103.97826
 tokens_test(txt, FUN = future_lapply)  43.63455  46.37512  50.62325  47.57711
      tokens_test(txt, FUN = mclapply)  38.51771  41.45957  44.54972  43.83123
        uq       max neval
 109.62283 114.44231    10
  55.19460  61.03482    10
  46.56468  54.70625    10

kbenoit commented 4 years ago

Outside of RStudio (plain R console):

                 expr      min        lq     mean    median       uq      max neval cld
 tokens_parallel(txt)  7.83327  8.651916 11.62100  9.208451 11.29673 30.89798    10  a 
          tokens(txt) 24.98809 25.313775 25.96159 25.797100 26.46475 27.82606    10   b

!!

koheiw commented 4 years ago

On Widnows (Rstudio)

                 expr      min       lq     mean   median       uq      max neval
 tokens_parallel(txt) 38.19173 39.43982 42.30942 41.41162 41.63359 55.98195    10
          tokens(txt) 87.46992 88.77185 89.83614 89.41530 91.63647 91.99573    10

koheiw commented 4 years ago

Now, we can use future_lapply() but the questions is it reliable enough to replace lapply. It would be safer to allow users to use lapply when tokens("future" = FALSE) or quanteda_option("tokens_parallel" = FALSE).

kbenoit commented 4 years ago

Great idea to put this into quanteda_options, then we don't have to change tokens(), now, or in the future while parallelism evolves more in R. By turning it off by default as well, we are not breaking any CRAN test rules either.

koheiw commented 4 years ago

Like this?

if (quanteda_options("tokens_parallel")) {
   lapply_fun <- future::future_lapply
} else {
   lapply_fun <- base::lapply
}

kbenoit commented 4 years ago

Exactly - or even having the value be: "future" (for future_lapply), "parallel" (for mclapply), or "base" (for lapply). But it doesn't matter as much if we do it via options, since we can note it's experimental and subject to change.

koheiw commented 4 years ago

RStudio becomes unstable so better to run from the console.

require(quanteda)
require(future.apply)
require(parallel)
quanteda_options(threads = 8)
options("mc.cores" = 4)

corp <- readRDS("/home/kohei/Dropbox/Brexit/Data/data_corpus_guardian.RDS")

cl <- makeCluster(10L, useXDR = FALSE)
plan(cluster, workers = cl)

system.time({
  quanteda_options("tokens_lapply" = "future")
  out <- tokens(corp, verbose = TRUE)
})

system.time({
  quanteda_options("tokens_lapply" = "base")
  out <- tokens(corp, verbose = TRUE)
})

stopCluster(cl)

contefranz commented 1 year ago

Hello! I was hoping to find this issue closed and implemented but it still seems to be open.

I work with large corpora only (either around 100k long docs, or tens of millions of sentences). Tokenization is still a huge bottleneck.

Are there any progress regarding the parallelization of tokens()? Any guidance you guys have on how to come up with workarounds in addition to what has been discussed above?

Thanks!

koheiw commented 1 year ago

We are prepared to make it parallel using future.apply::future_lapply() here:

https://github.com/quanteda/quanteda/blob/d898c8edf7ce392a6c032b95bce654f3b633d683/R/tokens.R#L283-L301

But I was reluctant to make changes because I was not sure how much tokens() function become faster by this. Issues are

Transaction cost between parent and child processes is huge when objects are large (character strings as input; list of characters as output).
The memory usage will multiply (again, lists of character are large)

It would be great if you create a branch and test this approach on you data.

My current workaround is splitting corpus by year, tokenize and combine them. In this case, I run tokens() within future_lapply.

kbenoit commented 1 year ago

I think this stalled because we felt it needed further testing to ensure stability and the conditions under which the efficiency gains were realised, as @koheiw points out. The packages this relies on were also undergoing development at the time.

Other possible workarounds beyond which Kohei mentions: In v3 we also made it easier to use alternative tokenisers and then coerce these named lists of characters to "tokens" class objects (which are serialised as integers and mapped to a unique type table for efficiency), using as.tokens(). So any other tokeniser that you can parallelise would also work this way.

We prefer the quanteda default tokeniser because it's smart (smarter than the ones in the tokenizers package for instance) and allows full user control to override a very conservative set of defaults, and once we review and complete https://github.com/quanteda/quanteda/pull/2165 will be even smarter. So agreed it would be great to make this efficiently parallel by default or through an option in tokens().

contefranz commented 2 months ago

Just to build up on your comments, I can confirm that using future_lapply() for tokenization and other operations like corpus summarization is much faster than the current implemented functions. For instance, tokenizing a corpus of 8700 long documents (e.g., 100 pages) takes about 4-5 minutes on an M1 iMac running with 4 cores. I'd say that's impressive.

I agree on the stability issues though. I had some problems when allocating cores via future::multisession(). Imposing future.seed = TRUE solved the problem for me though.

koheiw commented 2 months ago

Parallel tokenization is sometimes difficult because of the high memory usage. I routinely analyze tens of thousands of very long documents (hundreds of pages). I tokenize corpus separately by year and save to the disk. When I need documents from more than one year, I combine them using c(), which is very fast in v4.0.I think this is the best approach to very large dataset.

quanteda / quanteda

Consider parallelizing tokenization #1965