Closed ekstroem closed 6 years ago
It's a stringi thingy. An alternative: Use a proper parser, such as spaCy. The spacyr package will segment the sentences correctly, and you can then use some group_by()
etc ops to join the tokens back into sentences.
> library(spacyr)
> spacy_initialize()
Finding a python executable with spacy installed...
spaCy (language model: en) is installed in more than one python
spacyr will use /anaconda/bin/python (because ask = FALSE)
successfully initialized (spaCy Version: 2.0.1, language model: en)
> spacy_parse("This is some text. On line 2. 1972 was a bad year.")
doc_id sentence_id token_id token lemma pos entity
1 text1 1 1 This this DET
2 text1 1 2 is be VERB
3 text1 1 3 some some DET
4 text1 1 4 text text NOUN
5 text1 1 5 . . PUNCT
6 text1 2 1 On on ADP
7 text1 2 2 line line NOUN
8 text1 2 3 2 2 NUM CARDINAL_B
9 text1 2 4 . . PUNCT
10 text1 2 5 SPACE
11 text1 3 1 1972 1972 NUM DATE_B
12 text1 3 2 was be VERB
13 text1 3 3 a a DET DATE_B
14 text1 3 4 bad bad ADJ DATE_I
15 text1 3 5 year year NOUN DATE_I
16 text1 3 6 . . PUNCT
Perfect. Thanks!
When tokenizing on sentences I run into problems with sentences starting with numbers.
I'd really like this to be three sentences and not just two but because 1972 is not "capitalized" I run into problems.
This might be something that is related to the
stringi
package, but I'm not entirely sure and maybe you have a crude fix for this.