Closed joshlk closed 4 years ago
Thanks @joshlk !
It's not possible to use Regex
Could you elaborate on that? For instance in Python,
>>> re.split('(?<=[!.?])', "Here is one. Here is another? Bang!! This trailing text is one more")
['Here is one.', ' Here is another?', ' Bang!', '!', ' This trailing text is one more']
I agree it's a bit awkward with the look-behind pattern (and the spacing is not in the sample place) but it should work. It could have even been possible to handle multiple repeated punctuations but,
>>> re.split('(?<=[!.?]+)', "Here is one. Here is another? Bang!! This trailing text is one more")
[...]
raise error("look-behind requires fixed-width pattern")
re.error: look-behind requires fixed-width pattern
I wonder it Rust regex being more advanced would accept it.
I've tried using various techniques using the Rust Regex module and can't seem to get something that works. Have a go.
Aww right, regex crate doesn't support look-around.
We could consider adding https://github.com/fancy-regex/fancy-regex or https://github.com/rust-onig/rust-onig as a dependency, though I have mixed feeling between that and doing an implementation from scratch as done in this PR. I'll try to review in more detail within a few days.
Have a go.
Nice I wasn't aware of rust playground.
We could consider adding https://github.com/fancy-regex/fancy-regex or https://github.com/rust-onig/rust-onig as a dependency, though I have mixed feeling between that and doing an implementation from scratch as done in this PR.
Yes - I am unsure what is better as well. The iterator isn't hugely complicated but would be nice if you could review when you get a chance. I have added a large number of tests that include a variety of unicode languages see src/tokenize_sentence/tests.rs
line 267. It would be possible to experiment with other regex solutions and tests against these examples.
Interestingly, when you compare the output of PunctuationTokenizer
and UnicodeSentenceTokenizer
against the variety of unicode languages there are some differences. Just from eye-balling the output (without any understanding of the languages) both implementations have different pitfalls. Here are some random examples I noticed:
Language | Text | UnicodeSentenceTokenizer | PunctuationTokenizerParams |
---|---|---|---|
Bengali | ভাষা।[৫] | "ভাষা।[", "৫]" |
"ভাষা।", "[৫]" |
Hindi | गया था।[4][5] | "गया था।[", "4][5]" |
"गया था।", "[4][5]" |
Tamil | வந்தது.[15] | "வந்தது.[", "15]" |
"வந்தது.", "[15]" |
German | Deutsch (abgekürzt dt.) ist eine. | "Deutsch (abgekürzt dt.) ist eine." | "Deutsch (abgekürzt dt.", ") ist eine." |
Rusts Unicode support just blows me away. It's far better than other programming languages that I have worked with.
One issue I had was how to import a macro across the different files. i.e. I have the vecString
macro copied in 3 files: src/tokenize_sentence/mod.rs
, src/tokenize_sentence/tests.rs
and python/src/tokenize_sentence.rs
. Do you know how I can import the macro from one file to another? I tried the obvious answers in the Rust examples but couldn't get any of them to work.
Interestingly, when you compare the output of PunctuationTokenizer and UnicodeSentenceTokenizer against the variety of unicode languages there are some differences.
Interesting. The sentence splitting on [
in UnicodeSentenceTokenizer
looks unfortunate. Testing different languages is very useful, but in the end I imagine it's mostly to ensure that we don't break multi-byte unicode characters in the middle? The punctuation itself used to tokenize would be the same anyway as far as I understand.
Rusts Unicode support just blows me away. It's far better than other programming languages that I have worked with.
It is! Avoiding dealing with Unicode in Cython or C was one of my main motivation for starting this repo.
Do you know how I can import the macro from one file to another?
I haven't used macros extensively, but there might be something about needing to use them from the top level of the crate https://github.com/rth/vtext/pull/70#discussion_r438036935
Also could you please add sentence tokenizers to https://github.com/rth/vtext/blob/master/doc/python-api.rst
I find it quite impressive that the unicode-segmentation crate doesn't necessarily break on punctuation,
>>> from vtext.tokenize_sentence import UnicodeSentenceTokenizer
>>> from vtext.tokenize_sentence import PunctuationTokenizer
>>> s = "Please confirm R.S.V.P. by tomorrow. Thanks!!"
>>> UnicodeSentenceTokenizer().tokenize(s)
['Please confirm R.S.V.P. by tomorrow. ', 'Thanks!!']
>>> PunctuationTokenizer().tokenize(s)
['Please confirm R.', 'S.', 'V.', 'P. ', 'by tomorrow. ', 'Thanks!']
BTW, there is a bug in PunctuationTokenizer
in the last case for the last !
. It might be helpful to create random strings with hypothesis in python/vtext/tests/test_tokenize.py
containing punctuation, whitespaces + few letters and check that after tokenization + concatenation the length of the string is unchanged.
Cheers for the feedback 😃. I will review today.
The punctuation itself used to tokenize would be the same anyway as far as I understand.
The punctuation set used for the other languages is (from spaCy):
let punct = vecString![
'!', '.', '?', '։', '؟', '۔', '܀', '܁', '܂', '߹', '।', '॥', '၊', '။', '።', '፧', '፨', '᙮',
'᜵', '᜶', '᠃', '᠉', '᥄', '᥅', '᪨', '᪩', '᪪', '᪫', '᭚', '᭛', '᭞', '᭟', '᰻', '᰼', '᱾', '᱿',
'‼', '‽', '⁇', '⁈', '⁉', '⸮', '⸼', '꓿', '꘎', '꘏', '꛳', '꛷', '꡶', '꡷', '꣎', '꣏', '꤯', '꧈',
'꧉', '꩝', '꩞', '꩟', '꫰', '꫱', '꯫', '﹒', '﹖', '﹗', '!', '.', '?', '𐩖', '𐩗', '𑁇', '𑁈',
'𑂾', '𑂿', '𑃀', '𑃁', '𑅁', '𑅂', '𑅃', '𑇅', '𑇆', '𑇍', '𑇞', '𑇟', '𑈸', '𑈹', '𑈻', '𑈼', '𑊩', '𑑋',
'𑑌', '𑗂', '𑗃', '𑗉', '𑗊', '𑗋', '𑗌', '𑗍', '𑗎', '𑗏', '𑗐', '𑗑', '𑗒', '𑗓', '𑗔', '𑗕', '𑗖', '𑗗',
'𑙁', '𑙂', '𑜼', '𑜽', '𑜾', '𑩂', '𑩃', '𑪛', '𑪜', '𑱁', '𑱂', '𖩮', '𖩯', '𖫵', '𖬷', '𖬸', '𖭄', '𛲟',
'𝪈', '。', '。'
];
I will add this to the documentation and direct users the the examples in the tests.
Added punctuation sentence tokenizer. It's not possible to use Regex and so implemented an iterator instead. I will look into adding the Python bit.