trinker / sentimentr

Dictionary based sentiment analysis that considers valence shifters
Other
427 stars 84 forks source link

Improve sentiment_by documentation c() vs list() usage #71

Closed trinker closed 6 years ago

trinker commented 6 years ago

Email Question

My name is Kota. Thanks for your hard working for creating great packages for text mining. I am currently playing with the sentimentr package, and I have a question for you. I saw your examples from your GitHub and CRAN manuals. I am puzzled with the by argument. You used list and c in the by argument. Depending on which I used, I get correct and incorrect results. Please see the following codes. There are three cases. In the first case, I used with. As you demonstrated in the CRAN manual, list in the by argument is working, but not c. In the second case, I created an object with get_sentences() first. Then, I used sentiment_by(). In this case, c works in the by argument, but not list. In the final case, I used the pipe operation. list in the by argument is OK, but not c.

Seeing the results, I cannot generalize how sentiment_by() is working. The only guess I have is that, depending on which operation users choose, sentiment_by() is treating the data set with different classes. The function is sometimes using data.table and some other times using data.frame or get_sentences_data_frame?

### sentiment_by() with by argument

# With Presidential debates data 2012
# This data set contains one sentence per row. 

# Case 1: with
# list --- OK
# c --- bad

with(presidential_debates_2012, 
     sentiment_by(get_sentences(dialogue), by = list(person, time))
    )

# This is wrong. Why?
with(presidential_debates_2012, 
     sentiment_by(get_sentences(dialogue), by = c("person", "time"))
    )

# Case 2: Create an object with get_sentences() and use it in sentiment_by()
# list --- bad
# c --- ok

# If I create foo first and use it in get_sentences(), I get the expected outcome.
foo <- presidential_debates_2012 %>%
       get_sentences()

sentiment_by(foo, by = c("person", "time"))

# But list() in by does not work. Why???
sentiment_by(foo, by = list(person, time))

# Case 3: pipe operation
# list --- OK
# c --- bad

presidential_debates_2012 %>%
    get_sentences() %$%
    sentiment_by(dialogue, list(person, time))

# This is a wrong output. Grouping is not happening as I wished.

presidential_debates_2012 %>%
    get_sentences() %$%
    sentiment_by(dialogue, by = c("person", "time"))

I have one more thing. The following code does not work. We need to use
%$% after get_sentences().

presidential_debates_2012 %>%
    get_sentences() %>%
    sentiment_by(dialogue, by = list(person, time)))

# Error in sentiment_by.get_sentences_data_frame(., dialogue, list(person,  : 
# object 'dialogue' not found

Is this because the objected passed is in get_sentences_data_frame and sentiment_by() cannot see column names? I am not good at understanding this kind of technical point. If you can help me understand this, that would be very helpful.

Response

This can be confusing. Often times the magrittr %>% makes things easier but can make it difficult to reason about. The pipe works really nicely when the first argument to a function is a data set and the other arguments specify the columns to operate on. All the packages in the tidyverse operate this way. The sentimentr package does not. It's first argument is text.var which usually expects a character vector not a data.frame in contrast to what dplyr functions expect. This is why you need the with() or passing it along in a chain, the %$% operator. Basically sentimentr uses non-standard evaluation when you use with() OR %$% and looks for the vectors within the data set passed to it. There is one exception to this...when you pass a get_sentences() object to sentiment_by() to the first argument which is text.var it calls the sentiment_by.get_sentences_data_frame method which requires text.var to be a get_sentences_data_frame object. Because this object is a dat.frame its method knows this and knows it can access the coluns of the get_sentences_data_frame object directly, it just needs the names of the columns to grab.

To illustrate this point understand that all three of these approaches operate exactly the same:

## method 1
presidential_debates_2012 %>%
    get_sentences() %>%
    sentiment_by(by = c('person', 'time'))

## method 2
presidential_debates_2012 %>%
    get_sentences() %$%
    sentiment_by(., by = c('person', 'time'))

## method 3
presidential_debates_2012 %>%
    get_sentences() %$%
    sentiment_by(dialogue, by = list(person, time))

Also realize that a get_sentences_data_frame object also has a column with a get_sentences_character class column which also has a method in sentimentr.

presidential_debates_2012 %>%
    get_sentences() %>%
    class()

presidential_debates_2012 %>%
    get_sentences() %>%
    lapply(class)

When you use with() OR %$% then you're not actually passing the get_sentences_data_frame object to sentimentr and hence the sentiment_by.get_sentences_data_frame method isn't called rather sentiment_by is evaluated in the environment/data of the get_sentences_data_frame object. You can force the object passed this way to be evaluated as a get_sentences_data_frame object and thus calling the sentiment_by.get_sentences_data_frame method by using the . operator as I've done in method 2 above. Otherwise you pase the name of the text column which is actually a get_sentences_character class and it calls its own method. In this case the by argument expects vectors or a list of vectors and since it's being evaluated within the data set you can use list().

I think it's difficult to reason about with() OR %$% AND chaining. To understand what's going on you must understand what the magritrr functions %>% and %$% do and how they interact with the methods of the sentimentr package. The sentimentr package was designed outside of the tidyverse...if dplyr existed it was at about the same time. Since then dplyr has gained popularity and I've tried to accomadate those who like piping the best I can while maintaining backward compatability. A change that would eliminate all confusion for those familiar with the tidyverse packages would break backward compatibility for sentimentr and a lot of people's code so this isn't an option.