Closed jwijffels closed 4 years ago
Thanks for the info and apologies for the long delay in getting to this! I am working on a new major release of cleanNLP (v3.0.0; on the v3 branch) that seems to cut down a good amount of the difference. Playing with different document numbers, it seems that the overhead is about 25% with 5 docs but decreases to 17% (500 docs) and 14% (1500 docs). Much better given all of the processing that needs to be done
library(cleanNLP)
library(udpipe)
data("brussels_reviews", package = "udpipe")
z <- brussels_reviews$feedback
m <- udpipe_download_model(language = "english-ewt")
model <- udpipe_load_model(m$file_model)
run_diff <- function(number) {
val <- c(0, 0)
start <- Sys.time()
x <- cnlp_annotate(input = z[seq_len(number)], verbose=FALSE)
val[1] <- Sys.time() - start
start <- Sys.time()
x <- udpipe(z[seq_len(number)], object = model)
val[2] <- Sys.time() - start
names(val) <- c("cleanNLP", "udpipe")
print(val)
}
run_diff(500)
> run_diff(5)
cleanNLP udpipe
0.3611801 0.2811899
> run_diff(500)
cleanNLP udpipe
38.52144 32.81028
> run_diff(1500) # the result just picks the largest unit and converts to minutes here
cleanNLP udpipe
1.627936 1.417492
Thx for picking up. Probably the remaining overhead is from your do.call rbind. You might also be interested in https://cran.r-project.org/web/packages/udpipe/vignettes/udpipe-parallel.html
Yes, I think that is where most of the remaining time is coming from. The parallel udpipe looks like something nice to incorporate into the next version. Thanks for the link!
I normally use the udpipe package directly, but I wanted to check on some other NLP packages. If I use udpipe directly, there is approx 50% speedup compared to using the cleanNLP interface. Maybe something to look into if you have time