harthur / classifier

Bayesian classifier with Redis backend
MIT License
624 stars 65 forks source link

UTF-8 support #5

Closed kolarski closed 10 years ago

kolarski commented 10 years ago

Sadly does not support UTF-8. The problem lies here:

getWords : function(doc) {
    if (_(doc).isArray()) {
      return doc;
    }
    var words = doc.split(/\W+/);
    return _(words).uniq();
  }
doc.split(/\W+/);

does not seem to work for UTF-8

Here is an example with Cyrilic language (like Russian):

"Надежда за обич еп.36 Тест".split(/\W+/);

This returns:

[ "", "36", "" ]

Instead should return something like this:

[ "Надежда", "за", "обич", "еп", "36", "Тест"]

I was looking for fix, but ended up here: http://stackoverflow.com/questions/280712/javascript-unicode-regexes