UTF-8 support - Githubissues

Sadly does not support UTF-8. The problem lies here:

getWords : function(doc) {
    if (_(doc).isArray()) {
      return doc;
    }
    var words = doc.split(/\W+/);
    return _(words).uniq();
  }

doc.split(/\W+/);

does not seem to work for UTF-8

Here is an example with Cyrilic language (like Russian):

"Надежда за обич еп.36 Тест".split(/\W+/);

This returns:

[ "", "36", "" ]

Instead should return something like this:

[ "Надежда", "за", "обич", "еп", "36", "Тест"]

harthur / classifier