Possible cleaning function for reimport

trinker / sentimentr

Dictionary based sentiment analysis that considers valence shifters

Other

426 stars 84 forks source link

Possible cleaning function for reimport #86

Open trinker opened 5 years ago

trinker commented 5 years ago

This would belong in textclean but things that are abbreviated forms like fan vs fanatic:

> sentiment(c("He's a nice guy", "can be a jerk. I'm not a fan."))
   element_id sentence_id word_count sentiment
1:          1           1          4      0.25
2:          2           1          4     -0.25
3:          2           2          4      0.00
> sentiment(c("He's a nice guy", "can be a jerk. I'm not a fanatic."))
   element_id sentence_id word_count sentiment
1:          1           1          4      0.25
2:          2           1          4     -0.25
3:          2           2          4      0.25

could be replaced:

WIP

fix_fan <- function(x, ...){
    gsub(paste0(pro_replacements, '(\\b[Ff]an)(\\b|s?)'), '\\1\\2atic\\3', x, perl = TRUE, ignore.case = TRUE)
}

pronouns <- c("s?he( i|')s", "(you|they|we)( a|')re", "I( a|')m")
pro_replacements <- paste0('(', paste(paste0('(', pronouns, ')'), collapse = '|'), ')')

fix_fan('He\'s the bigest fan I know.')

trinker commented 5 years ago

Would be in textclean but rexported by sentimentr

trinker commented 5 years ago

inputs <- c(
    "He's the bigest fan I know.",
    "I am a huge fan of his.",
    "I know she has lots of fans in his club",
    "I was cold and turned on the fan",
    "An air conditioner is better than 2 fans at cooling.",
    "I'm a really gigantic and humble fan of the book."
)

fix_fan <- function(x, pronoun.distance = 20, ...){

    gsub(
        paste0("((?:s?he(?: i| ha|')s|(?:you|they|we)(?: a|')re|I(?: a|')m).{1,", pronoun.distance, "})\\b(fan)(s?)\\b"), 
        '\\1\\2atic\\3', 
       x, 
       ignore.case = TRUE
    )

}

fix_fan2 <- function(x, pronoun.distance = 20, ...){

    stringi::stri_replace_all_regex(
        x,
        paste0("((?:s?he(?: i| ha|')s|(?:you|they|we)(?: a|')re|I(?: a|')m).{1,", pronoun.distance, "})\\b(fan)(s?)\\b"),  
        '$1$2atic$3', 
        opts_regex = stringi::stri_opts_regex(case_insensitive=TRUE)
    )

}

fix_fan(inputs)
fix_fan(inputs, 30)
fix_fan2(inputs)

trinker commented 5 years ago

Other examples include:

tibble::tribble(
  ~short,   ~long,
  "fan",    "fanatic",
  "emo",    "emotionally disturbed"
)

trinker commented 5 years ago

Note these are called shortenings:

https://en.oxforddictionaries.com/spelling/shortenings

and more formally: https://en.wikipedia.org/wiki/Clipping_(morphology)

trinker commented 5 years ago

y <- c('tazer', 'emo', 'typo', 'quake', 'scram')
lexicon::hash_sentiment_jockers_rinker[y]

Consider adding polarity table directly