statsmaths / cleanNLP

R package providing annotators and a normalized data model for natural language processing
GNU Lesser General Public License v2.1
209 stars 36 forks source link

speed #45

Closed jwijffels closed 4 years ago

jwijffels commented 5 years ago

I normally use the udpipe package directly, but I wanted to check on some other NLP packages. If I use udpipe directly, there is approx 50% speedup compared to using the cleanNLP interface. Maybe something to look into if you have time

library(cleanNLP)
library(udpipe)
data("brussels_reviews", package = "udpipe")

m <- udpipe_download_model(language = "english-ewt")
cnlp_init_udpipe(model_path = m$file_model)

Sys.time()
[1] "2018-12-27 14:02:30 CET"
x <- cnlp_annotate(input = brussels_reviews$feedback)
Sys.time()
[1] "2018-12-27 14:14:02 CET"

model <- udpipe_load_model(m$file_model)
Sys.time()
[1] "2018-12-27 14:15:18 CET"
x <- udpipe(brussels_reviews$feedback, object = model)
Sys.time()
[1] "2018-12-27 14:23:45 CET"
statsmaths commented 4 years ago

Thanks for the info and apologies for the long delay in getting to this! I am working on a new major release of cleanNLP (v3.0.0; on the v3 branch) that seems to cut down a good amount of the difference. Playing with different document numbers, it seems that the overhead is about 25% with 5 docs but decreases to 17% (500 docs) and 14% (1500 docs). Much better given all of the processing that needs to be done

library(cleanNLP)
library(udpipe)
data("brussels_reviews", package = "udpipe")
z <- brussels_reviews$feedback
m <- udpipe_download_model(language = "english-ewt")
model <- udpipe_load_model(m$file_model)

run_diff <- function(number) {
  val <- c(0, 0)

  start <- Sys.time()
  x <- cnlp_annotate(input = z[seq_len(number)], verbose=FALSE)
  val[1] <- Sys.time() - start

  start <- Sys.time()
  x <- udpipe(z[seq_len(number)], object = model)
  val[2] <- Sys.time() - start

  names(val) <- c("cleanNLP", "udpipe")
  print(val)
}

run_diff(500)
> run_diff(5)
 cleanNLP    udpipe
0.3611801 0.2811899
> run_diff(500)
cleanNLP   udpipe
38.52144 32.81028
> run_diff(1500) # the result just picks the largest unit and converts to minutes here
cleanNLP   udpipe
1.627936 1.417492
jwijffels commented 4 years ago

Thx for picking up. Probably the remaining overhead is from your do.call rbind. You might also be interested in https://cran.r-project.org/web/packages/udpipe/vignettes/udpipe-parallel.html

statsmaths commented 4 years ago

Yes, I think that is where most of the remaining time is coming from. The parallel udpipe looks like something nice to incorporate into the next version. Thanks for the link!