reynoldsnlp / udar

UDAR Does Accented Russian: A finite-state morphological analyzer of Russian that handles stressed wordforms.
GNU General Public License v3.0
26 stars 1 forks source link

profile analysis to find bottlenecks #28

Closed reynoldsnlp closed 4 years ago

reynoldsnlp commented 4 years ago

Maybe start by comparing a) creating lots of little Texts and b) creating one massive Text.

reynoldsnlp commented 4 years ago

Wow, so with lots of short texts, the hfst-tokenize subprocess is taking almost 98% of the time. Not sure if that is just subprocess overhead, or whether hfst-tokenize is just that slow.

image

reynoldsnlp commented 4 years ago

I wrote a new implementation using pexpect. I thought that it might be faster because it only opens the subprocess once, and then you can use that same subprocess over and over, instead of starting new subprocesses repeatedly. It appears to be significantly faster, but still quite slow: image

I will run some tests with timeit to be sure.

reynoldsnlp commented 4 years ago

I discovered a bug in my pexpect implementation, and it was opening a new instance of pexpect for every Document/Sentence, instead of reusing the same instance over and over. Tokenization is now much faster. ;)

See a8e2a4369e365cd404b8a762846d6a7e34e92a20.