mjockers / syuzhet

An R package for the extraction of sentiment and sentiment-based plot arcs from text
332 stars 72 forks source link

Drop openNLP/rJava dependency #22

Closed trinker closed 7 years ago

trinker commented 7 years ago

Hi @mjockers I currently import the syuzhet dictionary into my own lexicon package per this PR: https://github.com/mjockers/syuzhet/pull/19

There is one downside...syuzhet utulizes openNLP to split sentences. I am writing to request an alternative sentence segmentation resource such as my own textshape package. My argument for dropping openNLP is 4 fold: (A) it provides a significant setup hurdle for many users, (B) it's less accurate, (C) much slower than alternatives and (D) openNLP strips the original element to sentence hierarchy.

First, openNLP provides a significant user setup hurdle. openNLP has an rJava dependency. rJava is a thorn in many users sides, including experienced users (e.g.: https://github.com/trinker/qdap/issues/232) and is difficult (or impossible) if you're trying to set up computing in a cloud service like Microsoft Azure. I have a network of packages that in turn rely on lexicon all making them a slave to rJava. Dropping the openNLP removes a java dependency making syuzhet R based and thus easier to set up.

Second, openNLP is less accurate than the textshape alternative I am proposing. Here we see similar use:

library(syuzhet)
library(textshape)
library(tidyverse)

my_example_text <- "I begin this story with a neutral statement.  
  Basically this is a very silly test.  
  You are testing the Syuzhet package using short, inane sentences.  
  I am actually very happy today. 
  I have finally finished writing this package.  
  Tomorrow I will be very sad. 
  I won't have anything left to do. 
  I might get angry and decide to do something horrible.  
  I might destroy the entire package and start from scratch.  
  Then again, I might find it satisfying to have completed my first R package. 
  Honestly this use of the Fourier transformation is really quite elegant.  
  You might even say it's beautiful!"

get_sentences(my_example_text)

textshape::split_sentence(my_example_text)

Now we amp it up with a subset of joyces_portrait and try openNLP getting an n = 727 vs. textshape getting an n = 758. Thats ~30 less sentences detected by the openNLP algorithm. I have used several reputable text programs (script @ end) for segmentation and compare their number of sentences: (A) coreNLP n = 756, (B) textblob n = 757, (C) nltk n = 757, (D) spacy n = 757 &, (E) pattern n = 758. We see textshape is much closer to these other segmentation tools than openNLP.

*Analyzed using: http://textanalysisonline.com

Third, openNLP is slow. Let's demo this by taking the subset of joyces_portrait and multiplying it by 100. The code below shows that the textshape approach is 21 times faster on this text at segmenting.

> ## subset of joyces_portrait
> x <- readLines('https://gist.githubusercontent.com/trinker/03a75e5fe935223d87085e50d01b981e/raw/f83d4170a11e4d60040a5f9d14ef9c3a0d7c22af/example_text')
> y <- paste(rep(x, 100), collapse = ' ')
> 
> gar <- gc(); start <- Sys.time()
> a <- get_sentences(y) 
> Sys.time() - start
Time difference of 52.11476 secs
> 
> gar <- gc(); start <- Sys.time()
> b <- textshape::split_sentence(y) 
> Sys.time() - start
Time difference of 2.43971 secs

Finally, get_sentences strips out the element ordering. In a book this may be less important but even then one wants to keep chapters or acts straight and it is difficult. The example below shows that get_sentences returns one vector of segmented sentences. textshape returns a list of 3, one for each act in the play.

z <- tibble::tibble(

  act = 1:3,
  text = c("I begin this story with a neutral statement.  
    Basically this is a very silly test.  
    You are testing the Syuzhet package using short, inane sentences.  
    I am actually very happy today. 
    I have finally finished writing this package.",  
    "Tomorrow I will be very sad. 
    I won't have anything left to do. 
    I might get angry and decide to do something horrible.",   
    "I might destroy the entire package and start from scratch.  
    Then again, I might find it satisfying to have completed my first R package. 
    Honestly this use of the Fourier transformation is really quite elegant.  
    You might even say it's beautiful!"
  )
)

get_sentences(z$text)
textshape::split_sentence(z$text)

## > get_sentences(z$text)
##  [1] "I begin this story with a neutral statement."                                
##  [2] "Basically this is a very silly test."                                        
##  [3] "You are testing the Syuzhet package using short, inane sentences."           
##  [4] "I am actually very happy today."                                             
##  [5] "I have finally finished writing this package."                               
##  [6] "Tomorrow I will be very sad."                                                
##  [7] "I won't have anything left to do."                                           
##  [8] "I might get angry and decide to do something horrible."                      
##  [9] "I might destroy the entire package and start from scratch."                  
## [10] "Then again, I might find it satisfying to have completed my first R package."
## [11] "Honestly this use of the Fourier transformation is really quite elegant."    
## [12] "You might even say it's beautiful!"                                          
## > textshape::split_sentence(z$text)
## [[1]]
## [1] "I begin this story with a neutral statement."                     
## [2] "Basically this is a very silly test."                             
## [3] "You are testing the Syuzhet package using short, inane sentences."
## [4] "I am actually very happy today."                                  
## [5] "I have finally finished writing this package."                    
## 
## [[2]]
## [1] "Tomorrow I will be very sad."                          
## [2] "I won't have anything left to do."                     
## [3] "I might get angry and decide to do something horrible."
## 
## [[3]]
## [1] "I might destroy the entire package and start from scratch."                  
## [2] "Then again, I might find it satisfying to have completed my first R package."
## [3] "Honestly this use of the Fourier transformation is really quite elegant."    
## [4] "You might even say it's beautiful!" 

This means that get_sentences will not play nicely in a dplyr mutate statement as the length returned is longer than the input resulting in an error. textshape on the other hand returns a list column:

> z %>%
+     dplyr::mutate(sents = get_sentences(text))
Error in mutate_impl(.data, dots) : 
  wrong result size (12), expected 3 or 1
> 
> z %>%
+     dplyr::mutate(sents = textshape::split_sentence(text)) 
# A tibble: 3 × 3
    act
  <int>
1     1
2     2
3     3
# ... with 2 more variables: text <chr>, sents <list>

Proposed non-OpenNLP Sentence Segmentation Function

This would be a possible non-openNLP segmentation approach with textshape: By switching to this function syuzhet could drop its openNLP dependency.

#' Sentence Tokenization
#' @description
#' Parses a string into a vector of sentences.
#' @param text_of_file A Text String
#' @param as_vector If \code{TRUE} the result is unlisted.  If \code{FALSE}
#' the result stays as a list of the original text string elements split into 
#' sentences.
#' @return A Character Vector of Sentences
#' @export
#' 
get_sentences <- function(text_of_file, as_vector = TRUE){
  if (!is.character(text_of_file)) stop("Data must be a character vector.")
  splits <- textshape::split_sentence(text_of_file)
  if (isTRUE(as_vector)) splits <- unlist(splits)
  splits
}

Thank you for your time and consideration for dropping the openNLP dependency from syuzhet.

additional code for comparing segmentation lengths of prominent text analysis software

## subset of joyces_portrait
x <- readLines('https://gist.githubusercontent.com/trinker/03a75e5fe935223d87085e50d01b981e/raw/f83d4170a11e4d60040a5f9d14ef9c3a0d7c22af/example_text')

## syuzhet via openNLP n = 727
get_sentences(x) %>%
    unlist() %>% 
    length()

## textshape n = 758
textshape::split_sentence(x) %>%
    unlist() %>% 
    length() 

## coreNLP n = 756
cmd <- "java -cp \"C:/stanford-corenlp-full-2016-10-31/*\" -mx5g edu.stanford.nlp.pipeline.StanfordCoreNLP -annotators \"tokenize,ssplit\""
results <- system(cmd, input = x, intern = TRUE, ignore.stderr = TRUE)
grep('^Sentence', results, value = TRUE) %>%
    length()

## http://textanalysisonline.com/
## textblob n = 757
readLines('https://gist.githubusercontent.com/trinker/3732befb8a7b0425ae9d7efddefab94e/raw/e98d6ec1a074af41df8aa6d447947d869b58a6c4/example_text_split') %>%
    stringi::stri_split_fixed('<br><br>') %>%
    unlist() %>%
    length()

## nltk n = 757
readLines('https://gist.githubusercontent.com/trinker/aed0942a326372df88884a8e61fe3122/raw/4c315adf27ff6a50c645fc7bada0c2eb9b43c4d0/example_text_nltk') %>%
    stringi::stri_split_fixed('<br><br>') %>%
    unlist() %>%
    length()

## spacy n = 757
readLines('https://gist.githubusercontent.com/trinker/8bfdfb4dd46e6787913ed542cb82e56e/raw/baf0c0a49cb0de53fac1cacff027ebe34cfda24a/example_text_spacy') %>%     
    stringi::stri_split_fixed('<br><br>') %>%
    unlist() %>%
    length()

## pattern n = 758
readLines('https://gist.githubusercontent.com/trinker/935b942aec8ff305d19de7e0c01129e2/raw/158831e219e386681b30f220765a79cf7d597ea6/example_text_pattern') %>%     
    stringi::stri_split_fixed('<br><br>') %>%
    unlist() %>%
    length()
mjockers commented 7 years ago

@trinker: Looks like a very good plan to me, and I agree that rJava is annoying. Do you have any info on how well textshape handles dialog (quoted material) in a text? openNLP was not great with dialog.

mjockers commented 7 years ago

@trinker: here is an example sentence that does not get split correctly with textshape but does with openNLP via get_sentences()

` test <- "‘He has asked the Administration to be sent there,’ said the other, ‘with the idea of showing what he could do; and I was instructed accordingly.’ They both agreed it was frightful, then made several bizarre remarks: ‘Make rain and fine weather—one man—the Council—by the nose’—bits of absurd sentences that got the better of my drowsiness when the uncle said, ‘The climate may do away with this difficulty for you." split_sentence(test) [[1]] [1] "‘He has asked the Administration to be sent there,’ said the other, ‘with the idea of showing what he could do; and I was instructed accordingly.’ They both agreed it was frightful, then made several bizarre remarks: ‘Make rain and fine weather—one man—the Council—by the nose’—bits of absurd sentences that got the better of my drowsiness when the uncle said, ‘The climate may do away with this difficulty for you."

get_sentences(test) [1] "‘He has asked the Administration to be sent there,’ said the other, ‘with the idea of showing what he could do; and I was instructed accordingly.’"
[2] "They both agreed it was frightful, then made several bizarre remarks: ‘Make rain and fine weather—one man—the Council—by the nose’—bits of absurd sentences that got the better of my drowsiness when the uncle said, ‘The climate may do away with this difficulty for you." `

trinker commented 7 years ago

@mjockers Thanks for the consideration. I plan to update textshape's algorithm to handle quoted material better. After I've tested it a bit I'll let you know and you can try it out and see if it could be a viable solution.

If you have any small tests sets with desired segmentation output for quoted material that could be useful in making the updates.

trinker commented 7 years ago

I added some handling for quoted text. Give it a whirl. The example you give suffers an additional problem in that the quotes are curly (non-ascii) quotes. split_sentence can't handle the curly quotes but I show how to below. If you decide to use textshape you'll have to decide if you want to handle the replacement of curly quotes in get_sentences or pass that along to the encoding work of the user.

library(devtools)
install_github('trinker/textshape')
library(textshape)

y <- c(paste(
    "\x91He has asked the Administration to be sent there,\x92 said the",
    "other, \x91with the idea of showing what he could do; and I was instructed",
    "accordingly.\x92 They both agreed it was frightful, then made several",
    "bizarre remarks: \x91Make rain and fine weather-one man-the Council-by the",
    "nose-bits of absurd sentences that got the better of my drowsiness when",
    "the uncle said, \x91The climate may do away with this difficulty for you.",
    "And one more, \x93How bout that!\x94 But still there is \x93another.\x94,",
    "but who?  No. 3 will.  No.  He will not!",
    collapse = ' '
), "I said no.  Now stop!", "I will not!  Yes you will.")
Encoding(y) <- "latin1"
y

## doesn't handle the non-ascii chars
split_sentence(y)

## [[1]]
## [1] "\u0091He has asked the Administration to be sent there,\u0092 said the other, \u0091with the idea of showing what he could do; and I was instructed accordingly.\u0092 They both agreed it was frightful, then made several bizarre remarks: \u0091Make rain and fine weather-one man-the Council-by the nose-bits of absurd sentences that got the better of my drowsiness when the uncle said, \u0091The climate may do away with this difficulty for you."
## [2] "And one more, \u0093How bout that!\u0094 But still there is \u0093another.\u0094, but who?"                                                                                                                                                                                                                                                                                                                                                                  
## [3] "No. 3 will."                                                                                                                                                                                                                                                                                                                                                                                                                                                 
## [4] "No."                                                                                                                                                                                                                                                                                                                                                                                                                                                         
## [5] "He will not!"                                                                                                                                                                                                                                                                                                                                                                                                                                                
## 
## [[2]]
## [1] "I said no." "Now stop!" 
## 
## [[3]]
## [1] "I will not!"   "Yes you will."

replace_curly <- function(x, ...){
    replaces <- c('\x91', '\x92', '\x93', '\x94')
    Encoding(replaces) <- "latin1"
    for (i in 1:4) {
        x <- gsub(replaces[i], c("'", "'", "\"", "\"")[i], x, fixed = TRUE)
    }
    x
}

split_sentence(replace_curly(y))

## [[1]]
## [1] "'He has asked the Administration to be sent there,' said the other, 'with the idea of showing what he could do; and I was instructed accordingly.'"                                                                                                                         
## [2] "They both agreed it was frightful, then made several bizarre remarks: 'Make rain and fine weather-one man-the Council-by the nose-bits of absurd sentences that got the better of my drowsiness when the uncle said, 'The climate may do away with this difficulty for you."
## [3] "And one more, \"How bout that!\""                                                                                                                                                                                                                                           
## [4] "But still there is \"another.\", but who?"                                                                                                                                                                                                                                  
## [5] "No. 3 will."                                                                                                                                                                                                                                                                
## [6] "No."                                                                                                                                                                                                                                                                        
## [7] "He will not!"                                                                                                                                                                                                                                                               
## 
## [[2]]
## [1] "I said no." "Now stop!" 
## 
## [[3]]
## [1] "I will not!"   "Yes you will."

A possible update of the function that includes handling of non-ascii:

#' Sentence Tokenization
#' @description
#' Parses a string into a vector of sentences.
#' @param text_of_file A Text String
#' @param fix_curly_quotes logical.  If \code{TRUE} curly quotes will be 
#' converted to ASCII representation before splitting.
#' @param as_vector If \code{TRUE} the result is unlisted.  If \code{FALSE}
#' the result stays as a list of the original text string elements split into 
#' sentences.
#' @return A Character Vector of Sentences
#' @export
#' 
get_sentences <- function(text_of_file, fix_curly_quotes = TRUE, as_vector = TRUE){

  if (!is.character(text_of_file)) stop("Data must be a character vector.")
  if (isTRUE(fix_curly_quotes)) text_of_file <- replace_curly(text_of_file)

  splits <- textshape::split_sentence(text_of_file)
  if (isTRUE(as_vector)) splits <- unlist(splits)
  splits
}

## helper curly quote function
replace_curly <- function(x, ...){
    replaces <- c('\x91', '\x92', '\x93', '\x94')
    Encoding(replaces) <- "latin1"
    for (i in 1:4) {
        x <- gsub(replaces[i], c("'", "'", "\"", "\"")[i], x, fixed = TRUE)
    }
    x
}
trinker commented 7 years ago

@mjockers I have made updates to the textshape package that will meet your needs for sentence boundary disambiguation tasks and sent it to CRAN (version 1.3.0). This will enable syuzhet to drop the rJava dependency via openNLP. You may have to install textshape 1.3.0 from source until the binaries are created though I believe you're a Mac user and El Capitan binaries have been built.

I've mad ethe changes in syuzhet and bumped the version to 1.0.2 in this pull request: https://github.com/mjockers/syuzhet/pull/24. I've done the CRAN checks on R 3.4.0 Windows and everything passes.

I hope this will suit your needs and you that you will consider incorporating the changes. Please let me know if there is anything else I can do to assist in the process.

Test out the pull request

install.packages("textshape", type="source")
library(devtools)
install_github('trinker/syuzhet')
library(syuzhet)

(x <- c(
"‘He has asked the Administration to be sent there,’ said the other, ‘with the idea of showing what he could do; and I was instructed accordingly.’ They both agreed it was frightful, then made several bizarre remarks: ‘Make rain and fine weather—one man—the Council—by the nose’—bits of absurd sentences that got the better of my drowsiness when the uncle said, ‘The climate may do away with this difficulty for you.",
paste0(
    "Mr. Brown comes! He says hello. i give him coffee.  i will ",
    "go at 5 p. m. eastern time.  Or somewhere in between!go there"
),
paste0(
    "Marvin K. Mooney Will You Please Go Now!", "The time has come.",
    "The time has come. The time is now. Just go. Go. GO!",
    "I don't care how."
)))

get_sentences(x)

##  [1] "'He has asked the Administration to be sent there,' said the other, 'with the idea of showing what he could do; and I was instructed accordingly.'"                                                                                                                                              
##  [2] "They both agreed it was frightful, then made several bizarre remarks: 'Make rain and fine weather\u0097one man\u0097the Council\u0097by the nose'\u0097bits of absurd sentences that got the better of my drowsiness when the uncle said, 'The climate may do away with this difficulty for you."
##  [3] "Mr. Brown comes!"                                                                                                                                                                                                                                                                                
##  [4] "He says hello."                                                                                                                                                                                                                                                                                  
##  [5] "i give him coffee."                                                                                                                                                                                                                                                                              
##  [6] "i will go at 5 p.m. eastern time."                                                                                                                                                                                                                                                               
##  [7] "Or somewhere in between!"                                                                                                                                                                                                                                                                        
##  [8] "go there"                                                                                                                                                                                                                                                                                        
##  [9] "Marvin K. Mooney Will You Please Go Now!"                                                                                                                                                                                                                                                        
## [10] "The time has come."                                                                                                                                                                                                                                                                              
## [11] "The time has come."                                                                                                                                                                                                                                                                              
## [12] "The time is now."                                                                                                                                                                                                                                                                                
## [13] "Just go."                                                                                                                                                                                                                                                                                        
## [14] "Go."                                                                                                                                                                                                                                                                                             
## [15] "GO!"                                                                                                                                                                                                                                                                                             
## [16] "I don't care how."          

get_sentences(x, as_vector = FALSE)
mjockers commented 7 years ago

Merged pull request #24 from trinker/master