Closed trinker closed 7 years ago
@trinker: Looks like a very good plan to me, and I agree that rJava is annoying. Do you have any info on how well textshape handles dialog (quoted material) in a text? openNLP was not great with dialog.
@trinker: here is an example sentence that does not get split correctly with textshape but does with openNLP via get_sentences()
` test <- "‘He has asked the Administration to be sent there,’ said the other, ‘with the idea of showing what he could do; and I was instructed accordingly.’ They both agreed it was frightful, then made several bizarre remarks: ‘Make rain and fine weather—one man—the Council—by the nose’—bits of absurd sentences that got the better of my drowsiness when the uncle said, ‘The climate may do away with this difficulty for you." split_sentence(test) [[1]] [1] "‘He has asked the Administration to be sent there,’ said the other, ‘with the idea of showing what he could do; and I was instructed accordingly.’ They both agreed it was frightful, then made several bizarre remarks: ‘Make rain and fine weather—one man—the Council—by the nose’—bits of absurd sentences that got the better of my drowsiness when the uncle said, ‘The climate may do away with this difficulty for you."
get_sentences(test)
[1] "‘He has asked the Administration to be sent there,’ said the other, ‘with the idea of showing what he could do; and I was instructed accordingly.’"
[2] "They both agreed it was frightful, then made several bizarre remarks: ‘Make rain and fine weather—one man—the Council—by the nose’—bits of absurd sentences that got the better of my drowsiness when the uncle said, ‘The climate may do away with this difficulty for you."
`
@mjockers Thanks for the consideration. I plan to update textshape's algorithm to handle quoted material better. After I've tested it a bit I'll let you know and you can try it out and see if it could be a viable solution.
If you have any small tests sets with desired segmentation output for quoted material that could be useful in making the updates.
I added some handling for quoted text. Give it a whirl. The example you give suffers an additional problem in that the quotes are curly (non-ascii) quotes. split_sentence
can't handle the curly quotes but I show how to below. If you decide to use textshape you'll have to decide if you want to handle the replacement of curly quotes in get_sentences
or pass that along to the encoding work of the user.
library(devtools)
install_github('trinker/textshape')
library(textshape)
y <- c(paste(
"\x91He has asked the Administration to be sent there,\x92 said the",
"other, \x91with the idea of showing what he could do; and I was instructed",
"accordingly.\x92 They both agreed it was frightful, then made several",
"bizarre remarks: \x91Make rain and fine weather-one man-the Council-by the",
"nose-bits of absurd sentences that got the better of my drowsiness when",
"the uncle said, \x91The climate may do away with this difficulty for you.",
"And one more, \x93How bout that!\x94 But still there is \x93another.\x94,",
"but who? No. 3 will. No. He will not!",
collapse = ' '
), "I said no. Now stop!", "I will not! Yes you will.")
Encoding(y) <- "latin1"
y
## doesn't handle the non-ascii chars
split_sentence(y)
## [[1]]
## [1] "\u0091He has asked the Administration to be sent there,\u0092 said the other, \u0091with the idea of showing what he could do; and I was instructed accordingly.\u0092 They both agreed it was frightful, then made several bizarre remarks: \u0091Make rain and fine weather-one man-the Council-by the nose-bits of absurd sentences that got the better of my drowsiness when the uncle said, \u0091The climate may do away with this difficulty for you."
## [2] "And one more, \u0093How bout that!\u0094 But still there is \u0093another.\u0094, but who?"
## [3] "No. 3 will."
## [4] "No."
## [5] "He will not!"
##
## [[2]]
## [1] "I said no." "Now stop!"
##
## [[3]]
## [1] "I will not!" "Yes you will."
replace_curly <- function(x, ...){
replaces <- c('\x91', '\x92', '\x93', '\x94')
Encoding(replaces) <- "latin1"
for (i in 1:4) {
x <- gsub(replaces[i], c("'", "'", "\"", "\"")[i], x, fixed = TRUE)
}
x
}
split_sentence(replace_curly(y))
## [[1]]
## [1] "'He has asked the Administration to be sent there,' said the other, 'with the idea of showing what he could do; and I was instructed accordingly.'"
## [2] "They both agreed it was frightful, then made several bizarre remarks: 'Make rain and fine weather-one man-the Council-by the nose-bits of absurd sentences that got the better of my drowsiness when the uncle said, 'The climate may do away with this difficulty for you."
## [3] "And one more, \"How bout that!\""
## [4] "But still there is \"another.\", but who?"
## [5] "No. 3 will."
## [6] "No."
## [7] "He will not!"
##
## [[2]]
## [1] "I said no." "Now stop!"
##
## [[3]]
## [1] "I will not!" "Yes you will."
A possible update of the function that includes handling of non-ascii:
#' Sentence Tokenization
#' @description
#' Parses a string into a vector of sentences.
#' @param text_of_file A Text String
#' @param fix_curly_quotes logical. If \code{TRUE} curly quotes will be
#' converted to ASCII representation before splitting.
#' @param as_vector If \code{TRUE} the result is unlisted. If \code{FALSE}
#' the result stays as a list of the original text string elements split into
#' sentences.
#' @return A Character Vector of Sentences
#' @export
#'
get_sentences <- function(text_of_file, fix_curly_quotes = TRUE, as_vector = TRUE){
if (!is.character(text_of_file)) stop("Data must be a character vector.")
if (isTRUE(fix_curly_quotes)) text_of_file <- replace_curly(text_of_file)
splits <- textshape::split_sentence(text_of_file)
if (isTRUE(as_vector)) splits <- unlist(splits)
splits
}
## helper curly quote function
replace_curly <- function(x, ...){
replaces <- c('\x91', '\x92', '\x93', '\x94')
Encoding(replaces) <- "latin1"
for (i in 1:4) {
x <- gsub(replaces[i], c("'", "'", "\"", "\"")[i], x, fixed = TRUE)
}
x
}
@mjockers I have made updates to the textshape package that will meet your needs for sentence boundary disambiguation tasks and sent it to CRAN (version 1.3.0). This will enable syuzhet to drop the rJava dependency via openNLP. You may have to install textshape 1.3.0 from source until the binaries are created though I believe you're a Mac user and El Capitan binaries have been built.
I've mad ethe changes in syuzhet and bumped the version to 1.0.2 in this pull request: https://github.com/mjockers/syuzhet/pull/24. I've done the CRAN checks on R 3.4.0 Windows and everything passes.
I hope this will suit your needs and you that you will consider incorporating the changes. Please let me know if there is anything else I can do to assist in the process.
install.packages("textshape", type="source")
library(devtools)
install_github('trinker/syuzhet')
library(syuzhet)
(x <- c(
"‘He has asked the Administration to be sent there,’ said the other, ‘with the idea of showing what he could do; and I was instructed accordingly.’ They both agreed it was frightful, then made several bizarre remarks: ‘Make rain and fine weather—one man—the Council—by the nose’—bits of absurd sentences that got the better of my drowsiness when the uncle said, ‘The climate may do away with this difficulty for you.",
paste0(
"Mr. Brown comes! He says hello. i give him coffee. i will ",
"go at 5 p. m. eastern time. Or somewhere in between!go there"
),
paste0(
"Marvin K. Mooney Will You Please Go Now!", "The time has come.",
"The time has come. The time is now. Just go. Go. GO!",
"I don't care how."
)))
get_sentences(x)
## [1] "'He has asked the Administration to be sent there,' said the other, 'with the idea of showing what he could do; and I was instructed accordingly.'"
## [2] "They both agreed it was frightful, then made several bizarre remarks: 'Make rain and fine weather\u0097one man\u0097the Council\u0097by the nose'\u0097bits of absurd sentences that got the better of my drowsiness when the uncle said, 'The climate may do away with this difficulty for you."
## [3] "Mr. Brown comes!"
## [4] "He says hello."
## [5] "i give him coffee."
## [6] "i will go at 5 p.m. eastern time."
## [7] "Or somewhere in between!"
## [8] "go there"
## [9] "Marvin K. Mooney Will You Please Go Now!"
## [10] "The time has come."
## [11] "The time has come."
## [12] "The time is now."
## [13] "Just go."
## [14] "Go."
## [15] "GO!"
## [16] "I don't care how."
get_sentences(x, as_vector = FALSE)
Merged pull request #24 from trinker/master
Hi @mjockers I currently import the syuzhet dictionary into my own lexicon package per this PR: https://github.com/mjockers/syuzhet/pull/19
There is one downside...syuzhet utulizes openNLP to split sentences. I am writing to request an alternative sentence segmentation resource such as my own textshape package. My argument for dropping openNLP is 4 fold: (A) it provides a significant setup hurdle for many users, (B) it's less accurate, (C) much slower than alternatives and (D) openNLP strips the original element to sentence hierarchy.
First, openNLP provides a significant user setup hurdle. openNLP has an rJava dependency. rJava is a thorn in many users sides, including experienced users (e.g.: https://github.com/trinker/qdap/issues/232) and is difficult (or impossible) if you're trying to set up computing in a cloud service like Microsoft Azure. I have a network of packages that in turn rely on lexicon all making them a slave to rJava. Dropping the openNLP removes a java dependency making syuzhet R based and thus easier to set up.
Second, openNLP is less accurate than the textshape alternative I am proposing. Here we see similar use:
Now we amp it up with a subset of joyces_portrait and try openNLP getting an n = 727 vs. textshape getting an n = 758. Thats ~30 less sentences detected by the openNLP algorithm. I have used several reputable text programs (script @ end) for segmentation and compare their number of sentences: (A) coreNLP n = 756, (B) textblob n = 757, (C) nltk n = 757, (D) spacy n = 757 &, (E) pattern n = 758. We see textshape is much closer to these other segmentation tools than openNLP.
*Analyzed using: http://textanalysisonline.com
Third, openNLP is slow. Let's demo this by taking the subset of joyces_portrait and multiplying it by 100. The code below shows that the textshape approach is 21 times faster on this text at segmenting.
Finally,
get_sentences
strips out the element ordering. In a book this may be less important but even then one wants to keep chapters or acts straight and it is difficult. The example below shows thatget_sentences
returns one vector of segmented sentences. textshape returns a list of 3, one for each act in the play.This means that
get_sentences
will not play nicely in a dplyrmutate
statement as the length returned is longer than the input resulting in an error. textshape on the other hand returns a list column:Proposed non-OpenNLP Sentence Segmentation Function
This would be a possible non-openNLP segmentation approach with textshape: By switching to this function syuzhet could drop its openNLP dependency.
Thank you for your time and consideration for dropping the openNLP dependency from syuzhet.
additional code for comparing segmentation lengths of prominent text analysis software