Closed erelsgl closed 11 years ago
Wordsworth relies on other libraries to read and parse the seed and training files that themselves are asynchronous operations - so currently initialization is not synchronous. However there are a handful of asynchronous flow-control options available in node, here are a couple of very quick and dirty examples of how you might accomplish your objective;
var async = require('async'),
sp = require('wordsworth').getInstance();
async.series({
one: function(callback) {
sp.initialize('../data/en_US/seed.txt', '../data/en_US/training.txt', function() {
callback();
});
};
}, function() {
// this callback is called when wordsworth is initialized and trained
// put app code in here
);
var events = require('events'),
em = new events.EventEmitter(),
sp = require('wordsworth').getInstance();
em.on('initialized', function() {
// place your app code here - or some reference to it
});
sp.initialize('../data/en_US/seed.txt', '../data/en_US/training.txt', function() {
em.emit('initialized');
});
Again, these are only two possibilities to get you moving in the right direction. There are also other libraries and modules that manage flow-control in the form of Promises and Futures which you might also want to check out.
Thanks, but these schemes still require to put the entire main program in a function.
What do you think about the following function, which accepts arrays instead of files:
SpellChecker.prototype.initializeSync = function(seedWords, trainSentences) {
var self=this;
seedWords.forEach(function(word) {
self.understand(word);
self.train(word);
});
trainSentences.forEach(function(sentence) {
self.train(sentence);
});
}
I think it is useful to have an initialization from arrays anyway.
This function can be used in the following way:
spellchecker.initializeSync(
fs.readFileSync(path.join(base,'seed.txt'),'utf-8').split("\n"),
fs.readFileSync(path.join(base,'training.txt'),'utf-8').split("\n")
);
[QUESTION: While I am at it, why is it good to separate between the "seeds" and the "training"? Isn't it true that all words in the training should be considered as "known"?]
The initialzeSync
method you've specified above looks like it would certainly do the trick! I would be happy to experiment with it, but I may not be able to get any solid work done it right away. If you feel so inclined, feel free to fork the repository and implement (and test) it on your end. If it ends up working the way you like, submit a pull request and I'll merge it into the main line. I can see it being a useful addition to the wordsworth API.
Currently the differences between seeding and training the model are subtle, though after some thought I consider your statement to be true, generally speaking, that words in the training text should be "known" to the internal dictionary. However I wouldn't anticipate or expect that a reasonably sized set of training text would contain every word for the given language - which is the purpose of the seed. Of course the counter-argument there, to further your original point, is that if the training text expresses enough of the given language to produce a somewhat reliable probability model then it may not matter in terms of usability if all words for the language are not "known".
OK, I will fork and add
erelsgl-master branch has been integrated.
Currently, the initialization is asynchronous, so, if I want to use wordsworth in my application, I have to put my entire application into the callback function...
Is there a way to initialize the speller synchronously?