Closed eugeneyan closed 5 years ago
Why do you prefer C binding than native Python code?
I had the impression that we needed the C bindings to use libthai with PyThai--is there an alternative approach?
Can you use wordcutpy instead of PyThai?
Thanks for the suggestion Vee. We've tried wordcutpy and it works well in tokenizing Thai.
Unfortunately, it's much slower than PyThai. For example, to tokenize 1.5 million individual Thai strings, it takes about 2.5 min on PyThai but 35 min on wordcutpy. Thus, using PyThai in production would be more appropriate. I would greatly appreciate if you could port PyThai to Python 3--nonetheless, we understand if you do not have the time to do so.
Thank you!
Here's the profile of the code for 10,000 rows--most of the time is spend on seek:
1394599 6.616 0.000 7.431 0.000 wordcutpy.py:19(seek)
384201 0.054 0.000 0.054 0.000 wordcutpy.py:42(is_better)
10000 2.145 0.000 10.707 0.001 wordcutpy.py:55(build_path)
I see. You need a fast one. If wordcutpy can segment your text in 5 minutes, will it be acceptable?
Yes. That would be great! I'm trying to augment seek with cython, but with no knowledge of C, I'm muddling around in the dark.
Now I use binary search for searching the word list. I will try to replace it with trie, which is similar to libthai.
By https://github.com/veer66/wordcutpy/commit/3e412e1299412ce1672a46d78d559fb6396809fa, it should be 4-5X faster than before.
By pypy3, wordcutpy runs 2X faster than on Python.
Hi Vee,
Thank you very much for your repos to process Thai. I understand you contributed the original python bindings from libthai to C that is used by PyThai.
Would you consider porting them to Python 3 please?
Original issue here: https://github.com/hermanschaaf/pythai/issues/3#issuecomment-270076358