Open Shantanu1497 opened 5 years ago
Writing this to you after running sentimentR on over 75K answers, solving ~200 sentences by hand and absolutely loving every R package you've ever authored/contributed to. Thanks for all the work!
There's one slight addition I'd like to make to sentimentR (Created a PR on trinker/sentimentr) and I'll present my side of the case to you - I've run a lot of the algorithm on conversational text data which means the data is usually some sort of a dialogue. A lot of users use commas and pause words right before* valence shifters which is causing differences in the sentiment calculation. I examined the Kaggle Movie Reviews dataset since it's plugged into your package and roughly 33% of sentences (total length of dataset ~7K) contain commas - and ~2% of those vary in sentiment due to the addition that I've made to the code.
Since Kaggle Movie Reviews isn't a conversational text dataset, I strongly believe that these numbers can go up and further increase the accuracy+scope of sentimentR in certain use-cases (also because I've only done this on commas for now, I could include all pauses if this makes sense to you) without worsening parts of it since the code only works when pauses are before valence shifters and lets them be if they're placed otherwise. [Currently in trial-mode, so runs if the neutral.nonverb.like = T and haven't created a new param for it yet. Will do if you think it's worth an addition.]
Attaching some snippets of results and code for you to see the difference yourself, would love to hear back on you.
I'm also coming up with an algorithm similar to the sentprop by Stanford around lexicon generation that can help automatically build valence shifters and polar words; I would love to hear your thoughts on it if that's feasible, when I've progressed more in that space.
Thanks a lot, Tyler! Your work is kickass and I'm really thankful! :)
P.S We're connected on LinkedIn: https://www.linkedin.com/in/shantanu--kumar/
Thanks @Shantanu1497 This sounds promising. I want to understand more.
You said:
A lot of users use commas and pause words right before* valence shifters which is causing differences in the sentiment calculation.
Can you explain from a linguistics standpoint what this means? Like not talking about the algo so much as talkiing about what pauses words/commas before valence shifters does?
haven't created a new param for it yet. Will do if you think it's worth an addition.
Yes, otherwie we're overloading one param with 2 different meanings
solving ~200 sentences by hand
I'd want to do testing of this algo once it's more complete and I have better understanding of the behavior you're trying to model. I may pay some mechanical turks to cretae a test data set. I also wonder if this behavior you're trying to capture is domains pecific or if it is generalizable across english speech contexts. Thoughts?
I'm also coming up with an algorithm similar to the sentprop by Stanford around lexicon generation that can help automatically build valence shifters and polar words; I would love to hear your thoughts on it if that's feasible, when I've progressed more in that space.
I love this idea. I had played with some prototyping of something along these lines but didn't have good success and hadn't found the development time to pursue further.
@Shantanu1497 Also I'm thinking...
Does it change sentiment direction or just magnitude typically? It's more useful if it correctly changes direction.
Also I'm wondering about the impact on speed. Once I understand the behavior you're trying to capture (I tried to rasd through quickly but many of the object variable names you used are generic and it's more difficult to parse semantically) there may be ways to have the speed impact be minimal. It might also be something that is handled by the textclean package and reexported by sentimentr.
Thanks a lot for responding @trinker, wrote the variables real quick to test it out myself - I'll clean up the code a bit + fix imports and package dependencies and create a different parameter. I'll drop a comment for you to review once I'm done making the changes.
From a linguistics standpoint, references (https://jakubmarian.com/comma-before-whereas-while-and-although/) and (https://www.businessinsider.in/THE-COMMA-SUTRA-13-Rules-For-Using-Commas-Without-Looking-Like-An-Idiot/articleshow/22667230.cms) - commas and pauses are always used before contrasting two sentences, comparing adjectives and offsetting negations.
In our case, hash_valence_shifters[hash_valence_shifters$y %in% c(1,4) ,]
(Adversative Conjunctions and Negators) directly fit the bill from the linguistics standpoint and the other two are slightly/indirectly incorporated , but 😄 from an algorithmic standpoint we would need all valence shifters to be treated this way.
The rule of thumb is: When you contrast two things, use a comma. “Whereas” is typically used to contrast two things:
I am very tall, whereas my wife is quite short. (correct)
I am very tall whereas my wife is quite short. (incorrect)
Use a comma before any coordinating conjunction (and, but, for, or, nor, so, yet) that links two independent clauses.
Example: "I went running, and I saw a duck."
Secondly, the behaviour is generalisable across the English language (particularly where the text contains perfect grammar usage) and the second addition where I tend to spelling errors also covers edge cases where text isn't grammatically perfect. I will post an additional comment soon displaying how & why my thoughts are along these lines + backing with data from the NYTimes Articles (Since I've already done Kaggle movie reviews and demonstrated a domain specific + not the best usage of grammar in a domain, I tried with this dataset considering this is as good as grammar gets and will help support the generalisability of the changes all across the English language) dataset from your package.
Testing shouldn't be a problem at all, I believe it could be split again between a domain specific dataset (maybe Twitter) and a dataset containing good usage of grammar to test both extremes of input text.
Thirdly, which I will post in another comment as I mentioned above - usefulness could be left to the user to decide and I'll try to describe two use cases, would love to be corrected if I'm wrong.
While converting Unbounded scores to a discrete range (as I had to), magnitude played a huge role. For example, Stanford converts scores to a range of 5. If the user wishes to convert values between 0 to +0.4 to positive and +0.4 and beyond to very positive, the change impacts sentiment classes hugely. [User is the primary beneficiary of this case]
As you quoted, direction changes do happen but in smaller proportions (will post proportions and cases in the next comment). This is definitely the most useful case we have, and building on to those smaller proportions while we don't impact the rest of the scores sounds like something that might be of interest to the user too. [sentimentR is the primary beneficiary of this case]
Speed impact is something I haven't looked at yet, but handling through textclean sounds like a great idea. Would love your inputs on this.
Running this on the NYTimes Articles dataset, I found that a substantial amount of text is affected by this change -> 3.38% of the entire dataset.
Out of the affected, 23% of the 3.38% (which amounts to ~0.8% of the entire dataset reverses polarity). To me, this looks like a sizeable number in both absolute and relative terms.
The median polarity score change (absolute values) comes out to 0.146 and the median % change in score comes to a whopping 81%.
I could drop the code in here for you to replicate - but if you'd want a quick run-through, these are the indices from the nyt_articles dataset within sentimentr that currently reverse polarity.
c(67, 532, 686, 788, 1021, 1042, 1080, 1137, 1349, 1425, 1836, 2033, 2060, 2177, 2260, 2372, 2414, 2437, 2862, 2999, 3016, 3025, 3084, 3243, 3318, 3491, 3623, 3665, 3923, 3998, 4006, 4186, 4432, 4510, 4554, 4665, 4747, 4867, 4875, 4917, 5162)
I believe I haven't best been able to answer the why of your linguistics standpoint question related to pauses and commas, I've probably only answered the what - although I wish to drill down on these ~0.8% that reversed polarities and the larger affected 3.38% to be able to come up with the exact effect and be better equipped to answer that soon.
Also, still have a strong feeling this percentage would go up once I include pauses apart from commas and if we switch domain to check out more text with dialogue; will verify that once I'm done writing the code up over here.
Hope this helps!
Hey @trinker, code looks ready for review. Would love if you could go through it and let me know if there's anything else required while I go through the affected answers and come up with something along the linguistics standpoint.
Running into an issue, not too sure what's causing it - would love some help.
The function run_preprocess removes commas both before and after valence shifters, as I mentioned above and it does perform so on my local machine.
As the reference picture says, it is a common mistake to use commas after these clauses and it does happen quite frequently. When I try to correct it using comma_handler = T, the preprocessing doesn't run somehow. But replicating that behaviour by running run_preprocess and then sentiment, it works for me on my local. Super confused. Is it something to do with the flow that I've missed out on?
The first two lines of the code show that the param works fine - but confused as to why the function won't work for the first if() statement in the preprocess function while it does run on local.
@Shantanu1497 Thank you. I plan to have some dev time to look at all of this towards the end of December.
Awesome, @trinker! I'll be done checking out other domains of text by then too and post those findings here. Merry Christmas!
@Shantanu1497 You said:
Out of the affected, 23% of the 3.38% (which amounts to ~0.8% of the entire dataset reverses polarity). To me, this looks like a sizeable number in both absolute and relative terms.
This gives a nice place to start some early testing (even though I haven't tried the new approach we can do some testing because you've done it and know that the poalrity of the current approach if flipped.
x <- c(67, 532, 686, 788, 1021, 1042, 1080, 1137, 1349, 1425, 1836, 2033, 2060, 2177, 2260, 2372,
2414, 2437, 2862, 2999, 3016, 3025, 3084, 3243, 3318, 3491, 3623, 3665, 3923, 3998, 4006, 4186,
4432, 4510, 4554, 4665, 4747, 4867, 4875, 4917, 5162)
library(sentimentr)
library(dplyr)
library(magrittr)
library(readr)
out <- nyt_articles %>%
dplyr::slice(x) %>%
mutate(
sentimentr = sentimentr::sentiment_by(text)$ave_sentiment,
wrong = sign(sentiment) != sign(sentimentr)
)
mean(out$wrong)
## [1] 0.5121951
readr::write_csv(out, 'sentimentr_testing.csv')
This tells me that about half the time on this subsample the current algorithm is incorrect as compared to the human scores (I haven't actually looked to ensure the human scores are what would generally be the typical direction most people would give a text). This means that the new algorithm would be wrong about the same (I was hoping the original algorithm was wrong on these most of the time but 50% makes it a toss up by simply switching the sign). I'd point out that this of course is specific to just this sample which is from a very specific domain. So we'd want to conduct similar tests across other known coded samples. It also makes me wonder if the updated algo you're proposing is almost an improvement but it needs to capture something else, an if/else that says if THIS condition then NEW algo ELSE then OLD algo, but am not sure what the condition is?
Checked out a dataset by Cornell; movie dialogues and conversations. https://www.cs.cornell.edu/~cristian/Cornell_Movie-Dialogs_Corpus.html
Also, the paper it is distributed with. https://www.cs.cornell.edu/~cristian/Chameleons_in_imagined_conversations.html
Out of a total of ~300K sentences from the movie corpus, 108K fit the conditions to run the experiment through. ( UTF-8 Format, Word Count > 1 and contains any valence shifter. )
The total number of sentences affected by the change equal 5195, which is 4.78% of the corpus.
The total number of sentences that reverse polarity equal 1719, which is 33.1% of the affected and 1.58% of the entire corpus.
Drilling down on the reversed polarities (still going through them); you're absolutely right that this needs to capture a certain behaviour of text usage. I'm looking at these 1700 responses, ones that come out strongly are sarcasm - the other, rhetoric. From what I think this to be too, it'll only be a substantial improvement WITH an if/else that captures a behaviour. I'm currently looking at cooccurrences of valence shifters, types of polar words and how any pattern shows certain behaviour. The paper by Cornell was great help.
This is the code to reproduce what I'm looking at. Ignore certain patters that come in the text that escape the regex - this captures the bulk. Uploaded the .txt file I'm using along with the generated .csv post this if you wish to skip right through.
Link to .txt and generated .csv
library(stringr)
library(sentimentr)
library(lexicon)
library(utf8)
library(dplyr)
library(qdap)
qa_dataset <- read_lines('movie_lines.txt')
qa_text <- gsub("L[0-9]* [+]*[$][+]* u[0-9]* [+]*[$][+]* m[0-9]* [+]*[$][+]* [a-zA-Z]*?. ?[a-zA-Z]* [+]*[$][+]* ","",qa_dataset) #Removing unnecessary symbols present in text.
qa_text <- qa_text[which(utf8::utf8_valid(qa_text))] #Keeping only valid UTF-8 format
indices <- vector()
for(i in 1:length(qa_text)){ if(any(str_detect(qa_text[i],hash_valence_shifters$x)) && word_count(qa_text[i])>1 ) indices <- append(i,indices) }
^ Not the best approach, but chose it for simplicity in explaining. Taking all dialogues that use valence shifters in the entire dataset and have a word length of more than 1.
qa_text_valence <- qa_text[indices]
preprocessed_sentences <- unlist(lapply(qa_text_valence,run_preprocess))
sentiments <- sentiment(get_sentences(preprocessed_sentences)) %>% group_by(element_id) %>% summarise(sent_mean=mean(sentiment))
sentiments_nonprocess <- sentiment(get_sentences(qa_text_valence)) %>% group_by(element_id) %>% summarise(sent_mean=mean(sentiment))
final_df <- inner_join(sentiments,sentiments_nonprocess,by='element_id')
reversed_indices <- which((final_df$sent_mean.x < 0 & final_df$sent_mean.y >= 0) | (final_df$sent_mean.y < 0 & final_df$sent_mean.x >= 0))
df_reversed <- round(final_df[reversed_indices,c(2,3)],4)
df_reversed$text <- qa_text_valence[reversed_indices]
names(df_reversed) <- c('Preprocess Run','Original sentimentr','Text')
write.csv(df_reversed,'reversed_cornell_movies.csv')
Also observed in the text today (As also pointed out in an open issue right now), there's a particular trend that is a low hanging fruit to solve towards the IF condition - any text where words like "ain't,isn't,aren't" are used post commas should be handled by the original algorithm.
These aren't exactly negators in such a context, but have more of an affirmation/reassurance seeking tone for the initial sentiment expressed when used after commas.
UPDATE: Reference: https://learnenglish.britishcouncil.org/intermediate-grammar/question-tags
Question tags - negative tags with positive sentences and vice versa. In our context, we could then be quite sure there's a positive polar word before the comma splits on one of the question tag words and flag the original algorithm to handle it.
UPDATE 2: Questions are already being taken into account through the algorithm using regex "\\?\\s*$"
. Taking care of question tags isn't required as an additional task but rather a build on to the pre-existing logic.
Three examples:
They aren't useful at all.
(Aren't being used as a negator here)
They're lovely people, aren't they?
(Different usage here)
The painting is beautiful, isn't it?
(Different usage here)
The trend is, that whenever there's either an adjective or a noun before these words (mostly adjective from the data I'm looking at) - or if we're looking at domains where English language is carefully structured and follows rules of language and linguistics, it makes things a lot simpler where the next word post comma usage should be one of "ain't,isn't,aren't". (Will add more words to the list post looking at more data)
So that we build for scale and generalise the NEW solution, we could assume condition number 2 (which is a quick build onto the current NEW algorithm since it captures the next word after a comma) to be happening almost all the time, which might just be the case we identify post looking at some more domains and data.
The aim of this entire exercise I'm currently planning could be:
On a different tangent, thinking of how a comma adding mechanism could also be looked at. Might be a simple inverse of this entire exercise or we might find a know-how a lot faster post this.
Just thinking out loud, hope this makes sense and has some structure to it.
UPDATE 3 One of the first conditions looks somewhat like: If input contains question tag, run through OLD algorithm. Almost done building this. Helps in fictional settings (imagined conversations) where text is generated in a controlled environment. Helping reduce error rates quite a bit in the movie dialogues corpus - replicable to domains where usage of question tags is high.
@trinker, these are two functions that find the presence of question tags. The logic behind them is that wherever a question mark is found, we record three words behind that and check if any of them are negators. If yes, the run_preprocess
function doesn't run on those indices.
Why I picked a 3 word window: The maximum likelihood of a negator in a question tag text is usually at position number 2, counting from the question mark; although sometimes it extends onto the first word and the third word as well.
Examples:
Function 1
indices_question_tag_dataset <- function(text){
splitted <- str_split((stringi::stri_extract_first(text, regex="[A-Za-z'a-zA-Z,]* [A-Za-z'a-zA-Z,]* [A-Za-z'a-zA-Z,]* ?\\?\\s*")),' ')
unlisted <- lapply(splitted, function (x) gsub("[',?]",'',x))
log <- lapply(unlisted,function (x) any(x %in% hash_valence_shifters$x[hash_valence_shifters$y==1]))
indices <- which(unlist(log))
return(indices)
}
Function 2
is_question_tag <- function(text){
splitted <- str_split((stringi::stri_extract_first(text, regex="[A-Za-z'a-zA-Z,]* [A-Za-z'a-zA-Z,]* [A-Za-z'a-zA-Z,]* ?\\?\\s*")),' ')
unlisted <- lapply(splitted, function (x) gsub("[',?]",'',x))
log <- lapply(unlisted,function (x) any(x %in% hash_valence_shifters$x[hash_valence_shifters$y==1]))
return(unlist(log))
}
The screenshot below demonstrates how the results vary - isn't too large here; but on the dataset I'm trying to work on to get human sentiment scores for (link above) - there's substantial difference since a huge chunk of the reversed text contains question tags. (470 out of 1719 -> ~28%)
This is the first of n-conditions we could use, handles all domains where fictional and imaginary conversations take place and reduces overall mean(out$wrong)
by excluding question tags from the preprocessing of commas. There may be more conditions we come across while switching domains that we integrate into the IF.
@Shantanu1497 I think adding commas makes sense before any coord conjunction as you mentioned above. have you guys found a way how to insert this commas? @trinker
@fahadshery Looking at removal as of now in certain cases, assuming insertion happens from the users end since that's assuming grammatically correct usage. Adding could be taken up in another phase if the current path I'm on makes sense to @trinker - Finding labelled sentiment datasets from across domains is super tough, realised this recently.
@trinker, these are two functions that find the presence of question tags. The logic behind them is that wherever a question mark is found, we record three words behind that and check if any of them are negators. If yes, the
run_preprocess
function doesn't run on those indices.Why I picked a 3 word window: The maximum likelihood of a negator in a question tag text is usually at position number 2, counting from the question mark; although sometimes it extends onto the first word and the third word as well.
Examples:
- They're great people, aren't they?
- They're great people, are they not?
- They're great people, don't you think?
Function 1
indices_question_tag_dataset <- function(text){
splitted <- str_split((stringi::stri_extract_first(text, regex="[A-Za-z'a-zA-Z,]* [A-Za-z'a-zA-Z,]* [A-Za-z'a-zA-Z,]* ?\\?\\s*")),' ')
unlisted <- lapply(splitted, function (x) gsub("[',?]",'',x))
log <- lapply(unlisted,function (x) any(x %in% hash_valence_shifters$x[hash_valence_shifters$y==1]))
indices <- which(unlist(log))
return(indices)
}
Function 2
is_question_tag <- function(text){
splitted <- str_split((stringi::stri_extract_first(text, regex="[A-Za-z'a-zA-Z,]* [A-Za-z'a-zA-Z,]* [A-Za-z'a-zA-Z,]* ?\\?\\s*")),' ')
unlisted <- lapply(splitted, function (x) gsub("[',?]",'',x))
log <- lapply(unlisted,function (x) any(x %in% hash_valence_shifters$x[hash_valence_shifters$y==1]))
return(unlist(log))
}
The screenshot below demonstrates how the results vary - isn't too large here; but on the dataset I'm trying to work on to get human sentiment scores for (link above) - there's substantial difference since a huge chunk of the reversed text contains question tags. (470 out of 1719 -> ~28%)
This is the first of n-conditions we could use, handles all domains where fictional and imaginary conversations take place and reduces overall
mean(out$wrong)
by excluding question tags from the preprocessing of commas. There may be more conditions we come across while switching domains that we integrate into the IF.
Hey @Shantanu1497 I'm coming back to this. For me the comma negator does change the valence magnitude but not the direction. To me this gain in accuracy isn't worth the loss in speed. With sentimentr I have first tried to get the average correct direction rate up above all, then the speed. I only add to the code base to change direction if it will up the accuracy significantly without harming speed significantly. I'm always less concerned about magnitude correctness. In order to make a change in the magnitude code base the speed would have to be essentially unaffected. Push back.
Also these PRs have become pretty massive. I'm looking to uncouple them and look at specific functionality in a piece meal way. Hoping you're not too busy to discuss here a bit.
Hey Tyler,
To me this gain in accuracy isn't worth the loss in speed.
Absolutely in sync with this, the functions here essentially do build on magnitude except for the question-tag ones which need a little more refining and better handling of certain use-cases. As an alternative and keeping in mind the motivation to build sentimentr, I could close this PR and we can drill down on the question tags functionality specifically (would be something that doesn't affect speed), upto you & open to any other suggestions or direction you're looking to consider?
Mailed Case and Changes description to tyler.rinker@gmail.com