Closed kolarski closed 10 years ago
You're right! I have the same problem for persian language it just return the numbers not the words. @kolarski did find a way to solve that?
There is word-regex
. If that doesn't support Cyrillic, we should probably submit a PR there.
I'm not familiar with regular expression. But also Persian is completely different with Cyrilic... Does your npm "nbayes" support utf-8?
@Hamidreza-Sedigh doesn't look like it. Although it should!
@derhuerst Can you tell me how to use word-regex
with bayes
?
@Hamidreza-Sedigh
Apparently, bayes
allows you to pass in a custom tokenizer, which basically split the input into words. The default tokenizer doesn't seem to treat Persian properly.
As a temporary solution, you could pass in a custom tokenizer into bases
. One may use word-regex
under the hood.
As a permanent solution, I guess @ttezel would be happy to accept a PR that makes bayesian language analysis easy for not only English. 😉
@Hamidreza-Sedigh
I had PR which added Cyrillic support to defaultTokenizer
: https://github.com/ttezel/bayes/pull/2 ,
but in retrospect word-regex
seems much better option because supports many other languages.
Can you verify word-regex
works for Persian using custom tokenizer. Something like:
var regex = require('word-regex')();
var classifier = bayes({
tokenizer: function (text) { return text.match(regex); }
})
Can you verify word-regex works for Persian using custom tokenizer.
You seem a lot more familiar with Persian. Also, I don't really have time right now.
Try to run the examples from the docs, but with Persian text, and check if they make sense.
That was referring to Hamidreza-Sedigh. I'm also not familiar with Persian.
Oops, you're right.
@derhuerst @kolarski thanks a lot for your helps
I couldn't solve it with regex (apparently it's only support English, CJK and Cyrillic ) but I finally do it with adding this code
var classifier = bayes({ tokenizer: function(text) { return text.split(' '); } });
the only problems is it return a null as a word:
vocabulary: { '': true, //another words }
@derhuerst Maybe you're right about adding a PR for support all languages.
Is there any good Docs for this package like how to use options or how to remove stopwords and other things
@Hamidreza-Sedigh
I couldn't solve it with regex (apparently it's only support English, CJK and Cyrillic )
Do you feel comfortable adding Persian support to word-regex
? I'm confident that they would appreciate this!
@derhuerst Is there any good Docs for this package like how to use options or how to remove stopwords and other things
I'd suggest to make a PR against word-regex
. bayes
can then use word-regex
to tokenise the input.
For an explanation on how a Naive Bayes classifier works, see this video: https://www.youtube.com/watch?v=DdYSMwEWbd4
Sadly does not support UTF-8. The problem lies here:
does not seem to work for UTF-8
Here is an example with Cyrilic language (like Russian):
This returns:
Instead should return something like this:
I was looking for fix, but ended up here: http://stackoverflow.com/questions/280712/javascript-unicode-regexes