Closed dhicks closed 2 years ago
I've also had this problem. As far as I can tell multithreading in spacyr_1.2.1 doesn't work, as top
does not show any additional processes being spawned, whether the setting is TRUE or FALSE.
I did succeed in building a parallelized workaround by setting multithread = FALSE and adding a doParallel/foreach framework on top: https://github.com/SeanFobbe/R-fobbe-proto-package/blob/main/f.dopar.spacyparse.R The same approach with a future front/backend fails because of non-exportable objects. Not sure why this doesn't affect the doParallel approach.
My setup is Fedora 34, running on an AMD Ryzen 7 3700X, using spacyr_1.2.1. I'm happy to supply smaller and larger corpora to test this, but I believe this is a spacyr issue, not a data issue. A good testing corpus (not too large) might be this one (is in German, though): https://doi.org/10.5281/zenodo.3902658
Fairly certain that this is related to #202 as multithread = TRUE
drastically increases RAM usage without a detectable speed boost.
We're aware and working on it... Moving to #185.
I'm playing around with spacyr as a potential replacement for cleanNLP. I've never really done anything directly with spaCy (ie, in Python).
I'm not seeing a big performance difference between
multithread = TRUE
(43 sec on a test dataset of 2k journal abstracts) andmultithread = FALSE
(53 sec). Is there some additional configuration I need to do to take advantage of multithreading?