Sentence tokenization using Unicode segmentation (Python package)

joshlk commented 4 years ago

First attempt at including the UnicodeSentenceTokenizer in the Python package. I have two issues that I am unsure how to resolve:

After using python3 setup.py develop to compile the lib which runs without errors, when trying to import the package in Python import _lib I get the following fatal error:

SystemError: Type does not define the tp_name field.
thread '<unnamed>' panicked at 'An error occurred while initializing class UnicodeSentenceTokenizer', ...
note: run with `RUST_BACKTRACE=1` environment variable to display a backtrace
fatal runtime error: failed to initiate panic, error 5

In python/src/tokenize_sentence.rs I have resorted to creating a second base class BaseTokenize2 (see line 14) as I am unsure how to import BaseTokenize from python/src/tokenize.rs

rth commented 4 years ago

Could you please merge master in, and update the wrapper to use PyO3 0.10, similarly to what was done for tokenizers in https://github.com/rth/vtext/pull/69. Maybe that would help.

joshlk commented 4 years ago

Updating to PyO3 0.10 seem to of done the trick. Importing BaseTokenize also now works as well (not sure if that was PyO3 0.10 or something else).

rth commented 4 years ago

Thanks @joshlk !

rth / vtext

Sentence tokenization using Unicode segmentation (Python package) #67