Closed reynoldsnlp closed 4 years ago
Wow, so with lots of short texts, the hfst-tokenize
subprocess is taking almost 98% of the time. Not sure if that is just subprocess overhead, or whether hfst-tokenize
is just that slow.
I wrote a new implementation using pexpect
. I thought that it might be faster because it only opens the subprocess once, and then you can use that same subprocess over and over, instead of starting new subprocesses repeatedly. It appears to be significantly faster, but still quite slow:
I will run some tests with timeit
to be sure.
I discovered a bug in my pexpect
implementation, and it was opening a new instance of pexpect
for every Document
/Sentence
, instead of reusing the same instance over and over. Tokenization is now much faster. ;)
See a8e2a4369e365cd404b8a762846d6a7e34e92a20.
Maybe start by comparing a) creating lots of little
Text
s and b) creating one massiveText
.