Open nsantini opened 6 years ago
Ok
Finally, made spell checking optional, and unit tested it
Thanks for the quick update @nsantini, great work! A few more things:
You seem to have added the distance/spellchecking feature exclusively to the english language (the only officially supported language for now, but other languages can be dynamically registered, see Adding new languages. What would it take to make this feature accessible to all languages? Is there a way to set a locale for the spellchecker? If so, we can move the distance/spellchecking modules up a level, and make them available to all languages.
Could you please add the necessary documentation to the README file? Documenting the spellcheck flag would be quite helpful.
Tests look good to me. If you manage to get around the first point of this post, consider adding another test to make sure that your changes are also supported for a new language.
We're almost there!
The current implementation uses https://www.npmjs.com/package/spellchecker , which are native bindings to NSSpellChecker, Hunspell, or the Windows 8 Spell Check API, depending on your platform. Windows 7 and below as well as Linux will rely on Hunspell. So it depends on what language you have in your system. Also it depends on those subsystems to support multiple languages. So, in a way, the spellchecking implementation supports multiple languages. I added the method signature to the language definition, so other language could support different implementations of the spellchecker. So the question would be, do we want all language spellcheckers to be implemented in one way depending on the used library, or allow each language to use what they want?
Does this mean that, for example, if my OS is set to use the spanish locale, then the spellchecker module would always try to spellcheck with spanish as the target language, regardless of the actual language of the input? If this is the case, I feel like this would lead to a very inconsistent behaviour.
Is there a way to specify to the spellchecker which language to use? /cc @thisandagain, thoughts?
There is a mechanism to add a dictionary to the spellchecker, I believe thats how a language gets "set"
I have refactored the spell checking functionality to use a different library nspell that can support multiple languages. Also updated README with examples of how to use spell checking
This is all great work, we're almost there!
languages/en
).
Do you think you can move the spellchecking / distance logic up a couple of levels and make it part of the core module (lib
)? This would make the feature available to all languages.
At the language-level, I think it is best if we only provide the appropriate dictionary, and delegate the rest of the processing to the core module. It would make adding new languages much easier.For example:
// languages/fr/index.js
module.exports = {
labels: require('./labels.json'),
dictionary: require('dictionary-fr'), // specify the language dictionary here for spellchecking
scoringStrategy: require('./scoring-strategy')
};
sentiment (Latest) - Short x 561,830 ops/sec ±1.76% (92 runs sampled)
sentiment (Latest) - Long x 2,689 ops/sec ±1.41% (87 runs sampled)
Sentimental (1.0.1) - Short x 314,373 ops/sec ±1.50% (89 runs sampled)
Sentimental (1.0.1) - Long x 1,171 ops/sec ±2.00% (88 runs sampled)
After:
sentiment (Latest) - Short x 239,996 ops/sec ±1.64% (91 runs sampled)
sentiment (Latest) - Long x 0.21 ops/sec ±3.55% (5 runs sampled)
Sentimental (1.0.1) - Short x 236,762 ops/sec ±5.24% (82 runs sampled)
Sentimental (1.0.1) - Long x 895 ops/sec ±2.54% (80 runs sampled)
Ideally there should be barely any impact when the spellchecking feature is disabled (which it should be, by default). Could you please investigate this issue?
npm run test:coverage
). Could you add a few more tests to cover those cases?Thank you for your patience @nsantini and for bearing with me!
In addition to the performance implications of this change necessitating the feature be "off" by default, I think the validation tests also strongly suggest that this should be optional:
Amazon accuracy: 0.7202797202797203
IMDB accuracy: 0.7642357642357642
Yelp accuracy: 0.6943056943056943
Amazon accuracy: 0.7302697302697303
IMDB accuracy: 0.7532467532467533
Yelp accuracy: 0.6943056943056943
Nice improvement (~1%) on the Amazon validation set, but the IMDB accuracy dropped ~1.1% and Yelp stayed stable. This may suggest that the change is indeed helping with less formal speech, but is actually causing false positives in a more formal corpus (or vs versa), but more investigation would be required.
Ok, moved the spell checking to the library and left loading the dictionary to the language. Added extra unit test, only two lines on the spell checking are not covered, but they are there for safekeeping of an edge case that shouldnt happen. Looking into performance and accuracy:
sentiment (Latest) - Short x 493,203 ops/sec ±1.14% (90 runs sampled) sentiment (Latest) - Long x 1,452 ops/sec ±3.65% (82 runs sampled) Sentimental (1.0.1) - Short x 306,074 ops/sec ±2.98% (86 runs sampled) Sentimental (1.0.1) - Long x 1,283 ops/sec ±2.45% (87 runs sampled)
Amazon accuracy: 0.7302697302697303 IMDB accuracy: 0.7532467532467533 Yelp accuracy: 0.6943056943056943
The changes to use spell checking, off by default, should not have affected accuracy, since I haven't changed how is tested.
Hi, latest changes:
nspell
and how to add dictionariesLGTM. @thisandagain?
@thisandagain any more comments on this?
Regarding your comment about looking for alternative string distance algorithms, I would suggest leave this PR as it is (since the current implementation is performant and correct), and open a new issue to look for alternatives
@thisandagain any calls regarding this PR?
@elyas-bhy @thisandagain any updates?
Everything looks good on my side. Still waiting for @thisandagain to approve the changes.
hi @thisandagain , just checking if you got a change to review the changes
Ping @thisandagain.
Should I close this PR? @thisandagain @elyas-bhy
I'm really looking forward to get this merged, but need @thisandagain to approve and merge this PR.
Ping @thisandagain
I added node-spellchecker to check for typos, and also "levenshtein" to find the closest spell correction to the original word.
I also modified the "negation" feature to look backwards until a negation word or a new afinn word is found, to cover cases like "not too bad".