trinker / sentimentr

Dictionary based sentiment analysis that considers valence shifters
Other
426 stars 84 forks source link

questions regarding repeated commas #83

Closed cschwem2er closed 6 years ago

cschwem2er commented 6 years ago

Hi,

thanks for the awesome package, which I'm currently using to analyze YouTube comments. As you surely know, social media data often does not contain very clean and grammatical correct text. Many of the millions of comments I'm analyzing look like this:

',,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,, blah'

'i mean no disrespect bigenosoo1,,,,, nictgranz, man , you are very angery person,, dude relax,,, get a girlfriend,,'

and contain a lot of repeated commas. Such comments receive very high / low sentiment scores (with sentimentr version 2.3.1). I guess this is not intended, because for instance the 2 comments above received a way more negative sentiment than the following ones:

'GO FUCK YOURSELF YOU ARROGANT PRICK GO FUCK YOURSELF YOU ARROGANT PRICK GO FUCK YOURSELF YOU ARROGANT PRICK'

'But that's wrong you fucking retard.'

Why is this the case? And do you suggest users to clean up things like repeated commas before using your package? Maybe this can also somehow be handled by the algorithm without additional preprocessing.

trinker commented 6 years ago

This would slow the detection down for a specific use case. My suggestion is to do some cleaning first, gsub(',+', ',' x) maybe?