#116 Allow token processing "middleware"

nsantini commented 6 years ago

I added node-spellchecker to check for typos, and also "levenshtein" to find the closest spell correction to the original word.

I also modified the "negation" feature to look backwards until a negation word or a new afinn word is found, to cover cases like "not too bad".

nsantini commented 6 years ago

Ok

moved the negation backwards strategy to the language model
added unit test for it

nsantini commented 6 years ago

Finally, made spell checking optional, and unit tested it

elyas-bhy commented 6 years ago

Thanks for the quick update @nsantini, great work! A few more things:

You seem to have added the distance/spellchecking feature exclusively to the english language (the only officially supported language for now, but other languages can be dynamically registered, see Adding new languages. What would it take to make this feature accessible to all languages? Is there a way to set a locale for the spellchecker? If so, we can move the distance/spellchecking modules up a level, and make them available to all languages.
Could you please add the necessary documentation to the README file? Documenting the spellcheck flag would be quite helpful.
Tests look good to me. If you manage to get around the first point of this post, consider adding another test to make sure that your changes are also supported for a new language.

We're almost there!

nsantini commented 6 years ago

The current implementation uses https://www.npmjs.com/package/spellchecker , which are native bindings to NSSpellChecker, Hunspell, or the Windows 8 Spell Check API, depending on your platform. Windows 7 and below as well as Linux will rely on Hunspell. So it depends on what language you have in your system. Also it depends on those subsystems to support multiple languages. So, in a way, the spellchecking implementation supports multiple languages. I added the method signature to the language definition, so other language could support different implementations of the spellchecker. So the question would be, do we want all language spellcheckers to be implemented in one way depending on the used library, or allow each language to use what they want?

elyas-bhy commented 6 years ago

Does this mean that, for example, if my OS is set to use the spanish locale, then the spellchecker module would always try to spellcheck with spanish as the target language, regardless of the actual language of the input? If this is the case, I feel like this would lead to a very inconsistent behaviour.

Is there a way to specify to the spellchecker which language to use? /cc @thisandagain, thoughts?

nsantini commented 6 years ago

There is a mechanism to add a dictionary to the spellchecker, I believe thats how a language gets "set"

nsantini commented 6 years ago

I have refactored the spell checking functionality to use a different library nspell that can support multiple languages. Also updated README with examples of how to use spell checking

elyas-bhy commented 6 years ago

This is all great work, we're almost there!

If we officially add support to one or more new languages, we would have to duplicate the code for the spellchecking feature (since it is currently located in languages/en). Do you think you can move the spellchecking / distance logic up a couple of levels and make it part of the core module (lib)? This would make the feature available to all languages. At the language-level, I think it is best if we only provide the appropriate dictionary, and delegate the rest of the processing to the core module. It would make adding new languages much easier.

For example:

// languages/fr/index.js
module.exports = {
    labels: require('./labels.json'),
    dictionary: require('dictionary-fr'),  // specify the language dictionary here for spellchecking
    scoringStrategy: require('./scoring-strategy')
};

I have noticed a drastic performance hit when comparing before/after this PR: Before:

sentiment (Latest) - Short  x 561,830 ops/sec ±1.76% (92 runs sampled)
sentiment (Latest) - Long   x 2,689 ops/sec ±1.41% (87 runs sampled)
Sentimental (1.0.1) - Short x 314,373 ops/sec ±1.50% (89 runs sampled)
Sentimental (1.0.1) - Long  x 1,171 ops/sec ±2.00% (88 runs sampled)

After:

sentiment (Latest) - Short  x 239,996 ops/sec ±1.64% (91 runs sampled)
sentiment (Latest) - Long   x 0.21 ops/sec ±3.55% (5 runs sampled)
Sentimental (1.0.1) - Short x 236,762 ops/sec ±5.24% (82 runs sampled)
Sentimental (1.0.1) - Long  x 895 ops/sec ±2.54% (80 runs sampled)

Ideally there should be barely any impact when the spellchecking feature is disabled (which it should be, by default). Could you please investigate this issue?

I have also noticed a slight decrease in code coverage (npm run test:coverage). Could you add a few more tests to cover those cases?

And finally, do you mind documenting this feature in the API Reference section of the README file as well?

Thank you for your patience @nsantini and for bearing with me!

thisandagain commented 6 years ago

In addition to the performance implications of this change necessitating the feature be "off" by default, I think the validation tests also strongly suggest that this should be optional:

Before

Amazon accuracy: 0.7202797202797203
IMDB accuracy: 0.7642357642357642
Yelp accuracy: 0.6943056943056943

After

Amazon accuracy: 0.7302697302697303
IMDB accuracy: 0.7532467532467533
Yelp accuracy: 0.6943056943056943

Nice improvement (~1%) on the Amazon validation set, but the IMDB accuracy dropped ~1.1% and Yelp stayed stable. This may suggest that the change is indeed helping with less formal speech, but is actually causing false positives in a more formal corpus (or vs versa), but more investigation would be required.

nsantini commented 6 years ago

Ok, moved the spell checking to the library and left loading the dictionary to the language. Added extra unit test, only two lines on the spell checking are not covered, but they are there for safekeeping of an edge case that shouldnt happen. Looking into performance and accuracy:

sentiment (Latest) - Short x 493,203 ops/sec ±1.14% (90 runs sampled) sentiment (Latest) - Long x 1,452 ops/sec ±3.65% (82 runs sampled) Sentimental (1.0.1) - Short x 306,074 ops/sec ±2.98% (86 runs sampled) Sentimental (1.0.1) - Long x 1,283 ops/sec ±2.45% (87 runs sampled)

Amazon accuracy: 0.7302697302697303 IMDB accuracy: 0.7532467532467533 Yelp accuracy: 0.6943056943056943

The changes to use spell checking, off by default, should not have affected accuracy, since I haven't changed how is tested.

nsantini commented 6 years ago

Hi, latest changes:

README references the documentation of nspell and how to add dictionaries

elyas-bhy commented 6 years ago

LGTM. @thisandagain?

nsantini commented 6 years ago

@thisandagain any more comments on this?

Regarding your comment about looking for alternative string distance algorithms, I would suggest leave this PR as it is (since the current implementation is performant and correct), and open a new issue to look for alternatives

nsantini commented 6 years ago

@thisandagain any calls regarding this PR?

nsantini commented 6 years ago

@elyas-bhy @thisandagain any updates?

elyas-bhy commented 6 years ago

Everything looks good on my side. Still waiting for @thisandagain to approve the changes.

nsantini commented 6 years ago

hi @thisandagain , just checking if you got a change to review the changes

elyas-bhy commented 6 years ago

Ping @thisandagain.

nsantini commented 5 years ago

Should I close this PR? @thisandagain @elyas-bhy

elyas-bhy commented 5 years ago

I'm really looking forward to get this merged, but need @thisandagain to approve and merge this PR.

elyas-bhy commented 4 years ago

Ping @thisandagain

thisandagain / sentiment

#116 Allow token processing "middleware" #144

Before

After