veer66 / wordcutpy

A simple word breaker written in Python
18 stars 8 forks source link

Would you port the libthai python bindings to Python 3 please? #2

Closed eugeneyan closed 5 years ago

eugeneyan commented 7 years ago

Hi Vee,

Thank you very much for your repos to process Thai. I understand you contributed the original python bindings from libthai to C that is used by PyThai.

Would you consider porting them to Python 3 please?

Original issue here: https://github.com/hermanschaaf/pythai/issues/3#issuecomment-270076358

veer66 commented 7 years ago

Why do you prefer C binding than native Python code?

eugeneyan commented 7 years ago

I had the impression that we needed the C bindings to use libthai with PyThai--is there an alternative approach?

veer66 commented 7 years ago

Can you use wordcutpy instead of PyThai?

eugeneyan commented 7 years ago

Thanks for the suggestion Vee. We've tried wordcutpy and it works well in tokenizing Thai.

Unfortunately, it's much slower than PyThai. For example, to tokenize 1.5 million individual Thai strings, it takes about 2.5 min on PyThai but 35 min on wordcutpy. Thus, using PyThai in production would be more appropriate. I would greatly appreciate if you could port PyThai to Python 3--nonetheless, we understand if you do not have the time to do so.

Thank you!

eugeneyan commented 7 years ago

Here's the profile of the code for 10,000 rows--most of the time is spend on seek:

1394599    6.616    0.000    7.431    0.000 wordcutpy.py:19(seek) 
384201    0.054    0.000    0.054    0.000 wordcutpy.py:42(is_better)
10000    2.145    0.000   10.707    0.001 wordcutpy.py:55(build_path)
veer66 commented 7 years ago

I see. You need a fast one. If wordcutpy can segment your text in 5 minutes, will it be acceptable?

eugeneyan commented 7 years ago

Yes. That would be great! I'm trying to augment seek with cython, but with no knowledge of C, I'm muddling around in the dark.

veer66 commented 7 years ago

Now I use binary search for searching the word list. I will try to replace it with trie, which is similar to libthai.

veer66 commented 7 years ago

By https://github.com/veer66/wordcutpy/commit/3e412e1299412ce1672a46d78d559fb6396809fa, it should be 4-5X faster than before.

veer66 commented 5 years ago

By pypy3, wordcutpy runs 2X faster than on Python.