mit-nlp / MITIE

MITIE: library and tools for information extraction
2.92k stars 535 forks source link

Bad offsets for tokenize_with_offsets with UTF-8 #211

Closed ankane closed 2 years ago

ankane commented 4 years ago

Hi, thanks for this great library!

When running the following script, MITIE tokenizes correctly, but the offsets it returns are off.

import mitie

print(mitie.tokenize_with_offsets(u'“hello”'))

Current Behavior

[(b'\xe2\x80\x9c', 0), (b'hello', 4463118537), (b'\xe2\x80\x9d', 4463118537)]

Expected Behavior

If offsets are measured in characters

[(b'\xe2\x80\x9c', 0), (b'hello', 1), (b'\xe2\x80\x9d', 6)]

Or if offsets are measured in bytes

[(b'\xe2\x80\x9c', 0), (b'hello', 3), (b'\xe2\x80\x9d', 8)]

I'm seeing the same behavior with the C API as well.