quanteda / spacyr

R wrapper to spaCy NLP
http://spacyr.quanteda.io
250 stars 38 forks source link

Using multithreading #206

Closed dhicks closed 2 years ago

dhicks commented 3 years ago

I'm playing around with spacyr as a potential replacement for cleanNLP. I've never really done anything directly with spaCy (ie, in Python).

I'm not seeing a big performance difference between multithread = TRUE (43 sec on a test dataset of 2k journal abstracts) and multithread = FALSE (53 sec). Is there some additional configuration I need to do to take advantage of multithreading?

SeanFobbe commented 2 years ago

I've also had this problem. As far as I can tell multithreading in spacyr_1.2.1 doesn't work, as top does not show any additional processes being spawned, whether the setting is TRUE or FALSE.

I did succeed in building a parallelized workaround by setting multithread = FALSE and adding a doParallel/foreach framework on top: https://github.com/SeanFobbe/R-fobbe-proto-package/blob/main/f.dopar.spacyparse.R The same approach with a future front/backend fails because of non-exportable objects. Not sure why this doesn't affect the doParallel approach.

My setup is Fedora 34, running on an AMD Ryzen 7 3700X, using spacyr_1.2.1. I'm happy to supply smaller and larger corpora to test this, but I believe this is a spacyr issue, not a data issue. A good testing corpus (not too large) might be this one (is in German, though): https://doi.org/10.5281/zenodo.3902658

Fairly certain that this is related to #202 as multithread = TRUE drastically increases RAM usage without a detectable speed boost.

kbenoit commented 2 years ago

We're aware and working on it... Moving to #185.