UTF-8 support - Githubissues

kolarski commented 10 years ago

Sadly does not support UTF-8. The problem lies here:

getWords : function(doc) {
    if (_(doc).isArray()) {
      return doc;
    }
    var words = doc.split(/\W+/);
    return _(words).uniq();
  }

doc.split(/\W+/);

does not seem to work for UTF-8

Here is an example with Cyrilic language (like Russian):

"Надежда за обич еп.36 Тест".split(/\W+/);

This returns:

[ "", "36", "" ]

Instead should return something like this:

[ "Надежда", "за", "обич", "еп", "36", "Тест"]

I was looking for fix, but ended up here: http://stackoverflow.com/questions/280712/javascript-unicode-regexes

Hamidreza-Sedigh commented 7 years ago

You're right! I have the same problem for persian language it just return the numbers not the words. @kolarski did find a way to solve that?

derhuerst commented 7 years ago

There is word-regex. If that doesn't support Cyrillic, we should probably submit a PR there.

Hamidreza-Sedigh commented 7 years ago

I'm not familiar with regular expression. But also Persian is completely different with Cyrilic... Does your npm "nbayes" support utf-8?

derhuerst commented 7 years ago

@Hamidreza-Sedigh doesn't look like it. Although it should!

Hamidreza-Sedigh commented 7 years ago

@derhuerst Can you tell me how to use word-regex with bayes ?

derhuerst commented 7 years ago

@Hamidreza-Sedigh

Apparently, bayes allows you to pass in a custom tokenizer, which basically split the input into words. The default tokenizer doesn't seem to treat Persian properly.

As a temporary solution, you could pass in a custom tokenizer into bases. One may use word-regex under the hood.

As a permanent solution, I guess @ttezel would be happy to accept a PR that makes bayesian language analysis easy for not only English. 😉

kolarski commented 7 years ago

@Hamidreza-Sedigh I had PR which added Cyrillic support to defaultTokenizer: https://github.com/ttezel/bayes/pull/2 , but in retrospect word-regex seems much better option because supports many other languages.

Can you verify word-regex works for Persian using custom tokenizer. Something like:

var regex = require('word-regex')();
var classifier = bayes({
    tokenizer: function (text) { return text.match(regex); }
})

derhuerst commented 7 years ago

Can you verify word-regex works for Persian using custom tokenizer.

You seem a lot more familiar with Persian. Also, I don't really have time right now.

Try to run the examples from the docs, but with Persian text, and check if they make sense.

kolarski commented 7 years ago

That was referring to Hamidreza-Sedigh. I'm also not familiar with Persian.

derhuerst commented 7 years ago

Oops, you're right.

Hamidreza-Sedigh commented 7 years ago

@derhuerst @kolarski thanks a lot for your helps I couldn't solve it with regex (apparently it's only support English, CJK and Cyrillic ) but I finally do it with adding this code var classifier = bayes({ tokenizer: function(text) { return text.split(' '); } }); the only problems is it return a null as a word: vocabulary: { '': true, //another words } @derhuerst Maybe you're right about adding a PR for support all languages. Is there any good Docs for this package like how to use options or how to remove stopwords and other things

derhuerst commented 7 years ago

@Hamidreza-Sedigh

I couldn't solve it with regex (apparently it's only support English, CJK and Cyrillic )

Do you feel comfortable adding Persian support to word-regex? I'm confident that they would appreciate this!

@derhuerst Is there any good Docs for this package like how to use options or how to remove stopwords and other things

I'd suggest to make a PR against word-regex. bayes can then use word-regex to tokenise the input.

For an explanation on how a Naive Bayes classifier works, see this video: https://www.youtube.com/watch?v=DdYSMwEWbd4

ttezel / bayes

UTF-8 support #1