mjhendrickson / rtweet-sentiment-analysis

Sentiment Analysis of Tweets via rtweet
MIT License
1 stars 0 forks source link

Select R Library for Sentiment Analysis #2

Closed mjhendrickson closed 4 years ago

mjhendrickson commented 4 years ago

Determine the best library, or libraries, to use for sentiment analysis.

syuzhet is utilized in the walkthrough by Michael Kearney (rtweet creator) https://mkearney.github.io/blog/2017/06/01/intro-to-rtweet/

tidytext as outlined in Text Mining with R by Julia Silge and David Robinson. https://www.tidytextmining.com/

SentimentAnalysis https://cran.r-project.org/web/packages/SentimentAnalysis/vignettes/SentimentAnalysis.html

sentimentr https://cran.r-project.org/web/packages/sentimentr/readme/README.html

saotd https://cran.r-project.org/web/packages/saotd/vignettes/saotd.html

mjhendrickson commented 4 years ago

First up, evaluate syuzhet per the walkthrough by Michael Kearney. https://mkearney.github.io/blog/2017/06/01/intro-to-rtweet/

Main takeaway - there are some useful elements, however the walkthrough no longer holds up as the tokenize argument no longer works. A suitable alternative to continue with the walkthrough was not found.

Modified the example to use #rstats related tweets.

Learned:

  1. plain_tweets() function, which will be useful in extracting text from the specified text fields.
  2. Mention of the tokenize argument within plain_tweets() does not work.
    1. Need another tokenization method. Tokenization ideas:
      1. https://masalmon.eu/2019/01/01/r-goals/
      2. https://twitter.com/juliasilge/status/1001553030011961345
      3. https://cran.r-project.org/web/packages/tokenizers/vignettes/introduction-to-tokenizers.html
    2. tokenizers::tokenize_tweets() may work.
mjhendrickson commented 4 years ago

tidytext as outlined in Text Mining with R by Julia Silge and David Robinson. https://www.tidytextmining.com/, specifically the section on Twitter at https://www.tidytextmining.com/twitter.html

Main takeaway - this is a rich resource that is highly usable and will be a wonderful reference moving forward in this project. However, this Twitter analysis does not cover sentiment. The rest of the text must be explored.

Learned:

  1. unnest_tokens() to do just that - tokenize, remove stopwords, and unnest the tokens. This utilizes the tokenizers package.
  2. nest() to nest dataframes within dataframes.
  3. map() to run functions across elements within a nested dataframe.
mjhendrickson commented 4 years ago

Next: evaluate SentimentAnalysis https://cran.r-project.org/web/packages/SentimentAnalysis/vignettes/SentimentAnalysis.html https://github.com/sfeuerriegel/SentimentAnalysis

Main takeaway - overall this is a strong package and has many great features for sentiment analysis. However, there are some possible limitations due to the construction and content of tweets.

It is possible to create your own dictionary, which could be useful to service tweet lexicon. While interesting, that may fall outside of the scope of this analysis.

Learned:

  1. There are multiple dictionaries for analyzing sentiment:
    1. Harvard-IV dictionary
    2. Henry’s Financial dictionary (Henry 2008)
    3. Loughran-McDonald Financial dictionary (Loughran and McDonald 2011)
    4. QDAP dictionary from the package qdapDictionaries
  2. You can compare and analyze sentiment with binaries, actual dictionary scores, or a combination.
  3. You can create custom dictionaries.
  4. The library includes a lasso regularization option to extract significant text based on a response driver.
mjhendrickson commented 4 years ago

Next: evaluate sentimentr https://cran.r-project.org/web/packages/sentimentr/readme/README.html

Main takeaway - this looks like a fantastic package with many great ways to visualize sentiment. I'm curious if these methods are available outside of the package. Unclear if the package has adequate support for Twitter without creating a custom dictionary. This may be more complicated given I do not have standard valence for Twitter words, though sentiment valence may be available in other packages. This route seems more complicated than it is worth if other packages already have good lexicon values.

Learned:

  1. Built due to limitations in qdap, syuzhet - focusing on a balance of accuracy vs speed. Attempts to take valence shifters into account (negations, amplifiers, de-amplifiers, adversative conjunctions).
  2. Two main functions with many helper functions: a. sentiment() b. sentiment_by()
  3. Vignette shows many great ways to plot sentiment.
  4. Easy to make and update dictionaries.
  5. Helpful comparison in the vignette between other packages: Comparing sentimentr, syuzhet, meanr, and Stanford.
mjhendrickson commented 4 years ago

Next: evaluate saotd https://cran.r-project.org/web/packages/saotd/vignettes/saotd.html

Main takeaway - this is an excellent package geared toward Twitter data - which also draws some elements from TidyText to shape the datagram. There is high utility here with data manipulation, analysis, and visualization.

Learned:

  1. Do not need rtweet directly to pull data. Can use the saotd function tweet_acquire. Seems possible to still use rtweet if preferred to gather more information, but may need more cleaning.
  2. Vignette draws from tidytext, which was evaluated above. Using tweet_tidy, tidytext creates tokens from each tweet, creating a new row for every word in the tweet, appending the word to the end of each record. This takes the single tweet record and creates copies of all fields, appending each word in sequence to each subsequent row. "The cleaning process removes: “@”, “#” and “RT” symbols, Weblinks, Punctuation, Emojis, and Stop Words like (“the”, “of”, etc.)."
  3. Explored unigrams, bigrams, trigrams - iterations of the n-gram, which is the continuous sequence of n items from the given text (here, Tweets).
  4. Sentiment analysis is iterative and requires looking at the n-grams to see if words should be combined to single entities (such as with merge_terms) or mispellings.
  5. saotd shows bigram_network and word_corr_network to show the network and correlations.
  6. Sentiment is calculated with posneg_words to get positive and negative sentiment by word. Words can easily be filtered out if they drag up/down the sentiment.
  7. Focus seems heavy on hashtags as opposed to true sentiment.
  8. Sentiment scores can be traced over time to show trending.
mjhendrickson commented 4 years ago

After reviewing each of the packages (syuzhet, tidytext, SentimentAnalysis, sentimentr, saotd), I will begin this analysis with saotd as it was created specifically for Twitter data and it utilizes some elements from TidyText. As I continue the analysis, I may branch out into the TidyText package or other packages as suitable.

mjhendrickson commented 4 years ago

Reconsider package given limitations of saotd. https://github.com/mjhendrickson/rtweet-sentiment-analysis/issues/4