Tokenize sentences starting with a number

ekstroem commented 6 years ago

When tokenizing on sentences I run into problems with sentences starting with numbers.

library(tokenizer)
tokenize_sentences("This is some text. On line 2. 1972 was a bad year")
[[1]]
[1] "This is some text."             "On line 2. 1972 was a bad year"

I'd really like this to be three sentences and not just two but because 1972 is not "capitalized" I run into problems.

This might be something that is related to the stringi package, but I'm not entirely sure and maybe you have a crude fix for this.

kbenoit commented 6 years ago

It's a stringi thingy. An alternative: Use a proper parser, such as spaCy. The spacyr package will segment the sentences correctly, and you can then use some group_by() etc ops to join the tokens back into sentences.

> library(spacyr)
> spacy_initialize()
Finding a python executable with spacy installed...
spaCy (language model: en) is installed in more than one python
spacyr will use /anaconda/bin/python (because ask = FALSE)
successfully initialized (spaCy Version: 2.0.1, language model: en)
> spacy_parse("This is some text. On line 2.  1972 was a bad year.")
   doc_id sentence_id token_id token lemma   pos     entity
1   text1           1        1  This  this   DET           
2   text1           1        2    is    be  VERB           
3   text1           1        3  some  some   DET           
4   text1           1        4  text  text  NOUN           
5   text1           1        5     .     . PUNCT           
6   text1           2        1    On    on   ADP           
7   text1           2        2  line  line  NOUN           
8   text1           2        3     2     2   NUM CARDINAL_B
9   text1           2        4     .     . PUNCT           
10  text1           2        5             SPACE           
11  text1           3        1  1972  1972   NUM     DATE_B
12  text1           3        2   was    be  VERB           
13  text1           3        3     a     a   DET     DATE_B
14  text1           3        4   bad   bad   ADJ     DATE_I
15  text1           3        5  year  year  NOUN     DATE_I
16  text1           3        6     .     . PUNCT

ekstroem commented 6 years ago

Perfect. Thanks!

ropensci / tokenizers

Tokenize sentences starting with a number #59