reynoldsnlp / udar

UDAR Does Accented Russian: A finite-state morphological analyzer of Russian that handles stressed wordforms.
GNU General Public License v3.0
26 stars 1 forks source link

HFSTTokenizer chokes on input longer than 550(?) characters #49

Open reynoldsnlp opened 4 years ago

reynoldsnlp commented 4 years ago

The interactive shell (accessed using pexpect) appears to limit line lengths over 550 (not really sure about this number) characters. If more are given, then bell characters (ascii codepoint 7, displayed as ^G in less) are printed to the logfile and pexpect hangs because it gets no output.

reynoldsnlp commented 4 years ago

Submitted issue to HFST about this: https://github.com/hfst/hfst/issues/483.

The maximum buffer size appears to be 1024 bytes, so a workaround could check len(bytes(input_str, encoding='utf8')) < 1000, and use a regular subprocess to process that string. This check shouldn't be too expensive.

reynoldsnlp commented 4 years ago

Workaround implemented in 765a2afb7d95d83b8bb179efe678fbd68e0d90fa.