Closed ankane closed 2 years ago
Hi, thanks for this great library!
When running the following script, MITIE tokenizes correctly, but the offsets it returns are off.
import mitie print(mitie.tokenize_with_offsets(u'“hello”'))
[(b'\xe2\x80\x9c', 0), (b'hello', 4463118537), (b'\xe2\x80\x9d', 4463118537)]
If offsets are measured in characters
[(b'\xe2\x80\x9c', 0), (b'hello', 1), (b'\xe2\x80\x9d', 6)]
Or if offsets are measured in bytes
[(b'\xe2\x80\x9c', 0), (b'hello', 3), (b'\xe2\x80\x9d', 8)]
I'm seeing the same behavior with the C API as well.
pip install git+https://github.com/mit-nlp/MITIE.git
Hi, thanks for this great library!
When running the following script, MITIE tokenizes correctly, but the offsets it returns are off.
Current Behavior
Expected Behavior
If offsets are measured in characters
Or if offsets are measured in bytes
I'm seeing the same behavior with the C API as well.
pip install git+https://github.com/mit-nlp/MITIE.git