Punctuation sentence tokenizer

joshlk commented 4 years ago

Added punctuation sentence tokenizer. It's not possible to use Regex and so implemented an iterator instead. I will look into adding the Python bit.

rth commented 4 years ago

Thanks @joshlk !

It's not possible to use Regex

Could you elaborate on that? For instance in Python,

>>> re.split('(?<=[!.?])', "Here is one. Here is another? Bang!! This trailing text is one more")
['Here is one.', ' Here is another?', ' Bang!', '!', ' This trailing text is one more']

I agree it's a bit awkward with the look-behind pattern (and the spacing is not in the sample place) but it should work. It could have even been possible to handle multiple repeated punctuations but,

>>> re.split('(?<=[!.?]+)', "Here is one. Here is another? Bang!! This trailing text is one more")
[...]
    raise error("look-behind requires fixed-width pattern")
re.error: look-behind requires fixed-width pattern

I wonder it Rust regex being more advanced would accept it.

joshlk commented 4 years ago

I've tried using various techniques using the Rust Regex module and can't seem to get something that works. Have a go.

rth commented 4 years ago

Aww right, regex crate doesn't support look-around.

We could consider adding https://github.com/fancy-regex/fancy-regex or https://github.com/rust-onig/rust-onig as a dependency, though I have mixed feeling between that and doing an implementation from scratch as done in this PR. I'll try to review in more detail within a few days.

Have a go.

Nice I wasn't aware of rust playground.

joshlk commented 4 years ago

We could consider adding https://github.com/fancy-regex/fancy-regex or https://github.com/rust-onig/rust-onig as a dependency, though I have mixed feeling between that and doing an implementation from scratch as done in this PR.

Yes - I am unsure what is better as well. The iterator isn't hugely complicated but would be nice if you could review when you get a chance. I have added a large number of tests that include a variety of unicode languages see src/tokenize_sentence/tests.rs line 267. It would be possible to experiment with other regex solutions and tests against these examples.

Interestingly, when you compare the output of PunctuationTokenizer and UnicodeSentenceTokenizer against the variety of unicode languages there are some differences. Just from eye-balling the output (without any understanding of the languages) both implementations have different pitfalls. Here are some random examples I noticed:

Language	Text	UnicodeSentenceTokenizer	PunctuationTokenizerParams
Bengali	ভাষা।[৫]	"ভাষা।[", "৫]"	"ভাষা।", "[৫]"
Hindi	गया था।[4][5]	"गया था।[", "4][5]"	"गया था।", "[4][5]"
Tamil	வந்தது.[15]	"வந்தது.[", "15]"	"வந்தது.", "[15]"
German	Deutsch (abgekürzt dt.) ist eine.	"Deutsch (abgekürzt dt.) ist eine."	"Deutsch (abgekürzt dt.", ") ist eine."

Rusts Unicode support just blows me away. It's far better than other programming languages that I have worked with.

One issue I had was how to import a macro across the different files. i.e. I have the vecString macro copied in 3 files: src/tokenize_sentence/mod.rs, src/tokenize_sentence/tests.rs and python/src/tokenize_sentence.rs. Do you know how I can import the macro from one file to another? I tried the obvious answers in the Rust examples but couldn't get any of them to work.

rth commented 4 years ago

Interestingly, when you compare the output of PunctuationTokenizer and UnicodeSentenceTokenizer against the variety of unicode languages there are some differences.

Interesting. The sentence splitting on [ in UnicodeSentenceTokenizer looks unfortunate. Testing different languages is very useful, but in the end I imagine it's mostly to ensure that we don't break multi-byte unicode characters in the middle? The punctuation itself used to tokenize would be the same anyway as far as I understand.

Rusts Unicode support just blows me away. It's far better than other programming languages that I have worked with.

It is! Avoiding dealing with Unicode in Cython or C was one of my main motivation for starting this repo.

Do you know how I can import the macro from one file to another?

I haven't used macros extensively, but there might be something about needing to use them from the top level of the crate https://github.com/rth/vtext/pull/70#discussion_r438036935

rth commented 4 years ago

Also could you please add sentence tokenizers to https://github.com/rth/vtext/blob/master/doc/python-api.rst

rth commented 4 years ago

I find it quite impressive that the unicode-segmentation crate doesn't necessarily break on punctuation,

>>> from vtext.tokenize_sentence import UnicodeSentenceTokenizer
>>> from vtext.tokenize_sentence import PunctuationTokenizer
>>> s = "Please confirm R.S.V.P. by tomorrow. Thanks!!"  
>>> UnicodeSentenceTokenizer().tokenize(s)
['Please confirm R.S.V.P. by tomorrow. ', 'Thanks!!']
>>> PunctuationTokenizer().tokenize(s)
['Please confirm R.', 'S.', 'V.', 'P. ', 'by tomorrow. ', 'Thanks!']

BTW, there is a bug in PunctuationTokenizer in the last case for the last !. It might be helpful to create random strings with hypothesis in python/vtext/tests/test_tokenize.py containing punctuation, whitespaces + few letters and check that after tokenization + concatenation the length of the string is unchanged.

joshlk commented 4 years ago

Cheers for the feedback 😃. I will review today.

joshlk commented 4 years ago

The punctuation itself used to tokenize would be the same anyway as far as I understand.

The punctuation set used for the other languages is (from spaCy):

    let punct = vecString![
        '!', '.', '?', '։', '؟', '۔', '܀', '܁', '܂', '߹', '।', '॥', '၊', '။', '።', '፧', '፨', '᙮',
        '᜵', '᜶', '᠃', '᠉', '᥄', '᥅', '᪨', '᪩', '᪪', '᪫', '᭚', '᭛', '᭞', '᭟', '᰻', '᰼', '᱾', '᱿',
        '‼', '‽', '⁇', '⁈', '⁉', '⸮', '⸼', '꓿', '꘎', '꘏', '꛳', '꛷', '꡶', '꡷', '꣎', '꣏', '꤯', '꧈',
        '꧉', '꩝', '꩞', '꩟', '꫰', '꫱', '꯫', '﹒', '﹖', '﹗', '！', '．', '？', '𐩖', '𐩗', '𑁇', '𑁈',
        '𑂾', '𑂿', '𑃀', '𑃁', '𑅁', '𑅂', '𑅃', '𑇅', '𑇆', '𑇍', '𑇞', '𑇟', '𑈸', '𑈹', '𑈻', '𑈼', '𑊩', '𑑋',
        '𑑌', '𑗂', '𑗃', '𑗉', '𑗊', '𑗋', '𑗌', '𑗍', '𑗎', '𑗏', '𑗐', '𑗑', '𑗒', '𑗓', '𑗔', '𑗕', '𑗖', '𑗗',
        '𑙁', '𑙂', '𑜼', '𑜽', '𑜾', '𑩂', '𑩃', '𑪛', '𑪜', '𑱁', '𑱂', '𖩮', '𖩯', '𖫵', '𖬷', '𖬸', '𖭄', '𛲟',
        '𝪈', '｡', '。'
    ];

I will add this to the documentation and direct users the the examples in the tests.

rth / vtext

Punctuation sentence tokenizer #70