mrmarbles / wordsworth

Basic spell-checker / spelling correcter module for Nodejs.
5 stars 4 forks source link

Synchronous initialization #1

Closed erelsgl closed 11 years ago

erelsgl commented 11 years ago

Currently, the initialization is asynchronous, so, if I want to use wordsworth in my application, I have to put my entire application into the callback function...

Is there a way to initialize the speller synchronously?

mrmarbles commented 11 years ago

Wordsworth relies on other libraries to read and parse the seed and training files that themselves are asynchronous operations - so currently initialization is not synchronous. However there are a handful of asynchronous flow-control options available in node, here are a couple of very quick and dirty examples of how you might accomplish your objective;

  1. Use the async library (https://github.com/caolan/async) series() method to ensure initialization without need to embed your entire application in the callback. I have not tested the following code, but this is generally I think what it would look like;
var async = require('async'),
  sp = require('wordsworth').getInstance();

async.series({
  one: function(callback) {
    sp.initialize('../data/en_US/seed.txt', '../data/en_US/training.txt', function() {
      callback();
    });
  };
}, function() {
  // this callback is called when wordsworth is initialized and trained
  // put app code in here
);
  1. Use an EventEmitter. Native to Nodejs in the "events" package - you could do something like this;
var events = require('events'),
  em = new events.EventEmitter(),
  sp = require('wordsworth').getInstance();

em.on('initialized', function() {
  // place your app code here - or some reference to it
});

sp.initialize('../data/en_US/seed.txt', '../data/en_US/training.txt', function() {
  em.emit('initialized');
});

Again, these are only two possibilities to get you moving in the right direction. There are also other libraries and modules that manage flow-control in the form of Promises and Futures which you might also want to check out.

erelsgl commented 11 years ago

Thanks, but these schemes still require to put the entire main program in a function.

What do you think about the following function, which accepts arrays instead of files:

SpellChecker.prototype.initializeSync = function(seedWords, trainSentences) {
var self=this;
seedWords.forEach(function(word) {
    self.understand(word);
    self.train(word);
});
trainSentences.forEach(function(sentence) {
    self.train(sentence); 
});
}

I think it is useful to have an initialization from arrays anyway.

erelsgl commented 11 years ago

This function can be used in the following way:

    spellchecker.initializeSync(
            fs.readFileSync(path.join(base,'seed.txt'),'utf-8').split("\n"),
            fs.readFileSync(path.join(base,'training.txt'),'utf-8').split("\n")
    );
erelsgl commented 11 years ago

[QUESTION: While I am at it, why is it good to separate between the "seeds" and the "training"? Isn't it true that all words in the training should be considered as "known"?]

mrmarbles commented 11 years ago

The initialzeSync method you've specified above looks like it would certainly do the trick! I would be happy to experiment with it, but I may not be able to get any solid work done it right away. If you feel so inclined, feel free to fork the repository and implement (and test) it on your end. If it ends up working the way you like, submit a pull request and I'll merge it into the main line. I can see it being a useful addition to the wordsworth API.

Currently the differences between seeding and training the model are subtle, though after some thought I consider your statement to be true, generally speaking, that words in the training text should be "known" to the internal dictionary. However I wouldn't anticipate or expect that a reasonably sized set of training text would contain every word for the given language - which is the purpose of the seed. Of course the counter-argument there, to further your original point, is that if the training text expresses enough of the given language to produce a somewhat reliable probability model then it may not matter in terms of usability if all words for the language are not "known".

erelsgl commented 11 years ago

OK, I will fork and add

mrmarbles commented 11 years ago

erelsgl-master branch has been integrated.