thisandagain / sentiment

AFINN-based sentiment analysis for Node.js.
MIT License
2.64k stars 311 forks source link

remove single quotes around words while preserving apostrophes #159

Open ejdweck opened 5 years ago

ejdweck commented 5 years ago

I was using the sentiment library and noticed when I ran analysis on headlines that utilized single quotes, the words were not being properly tokenized.

For example, for the news headline from cnn.com that reads:

Abrams: Trump is 'wrong,' I am qualified to be Georgia's governor

wrong should be tokenized from 'wrong' to wrong.

In its current state, the library successfully tokenizes words from double quotes but not from single quotes (my guess is to preserve apostrophes - if you add an \' to the .replace regex, all single quotes would be removed).

Here is some code to reproduce error:

var Sentiment = require('sentiment');
var sentiment = new Sentiment();

let noQuotes = "Abrams: Trump is wrong, I am qualified to be Georgia's governor";
let singleQuotes = "Abrams: Trump is \'wrong\', I am qualified to be Georgia's governor";
let doubleQuotes = "Abrams: Trump is \"wrong,\" I am qualified to be Georgia's governor"

let noQuotesResult = sentiment.analyze(noQuotes);
var doubleQuotesResult = sentiment.analyze(doubleQuotes);
var singleQuotesResult = sentiment.analyze(singleQuotes);

console.log(noQuotesResult);
console.log(doubleQuotesResult);
console.log(singleQuotesResult);
{ score: -2,
  comparative: -0.18181818181818182,
  tokens:
   [ 'abrams',
     'trump',
     'is',
     'wrong',
     'i',
     'am',
     'qualified',
     'to',
     'be',
     'georgia\'s',
     'governor' ],
  words: [ 'wrong' ],
  positive: [],
  negative: [ 'wrong' ] }
{ score: -2,
  comparative: -0.18181818181818182,
  tokens:
   [ 'abrams',
     'trump',
     'is',
     'wrong',
     'i',
     'am',
     'qualified',
     'to',
     'be',
     'georgia\'s',
     'governor' ],
  words: [ 'wrong' ],
  positive: [],
  negative: [ 'wrong' ] }
{ score: 0,
  comparative: 0,
  tokens:
   [ 'abrams',
     'trump',
     'is',
     '\'wrong\'',
     'i',
     'am',
     'qualified',
     'to',
     'be',
     'georgia\'s',
     'governor' ],
  words: [],
  positive: [],
  negative: [] }