harthur / classifier

Bayesian classifier with Redis backend
MIT License
625 stars 65 forks source link

Add Cyrilic Support #6

Closed kolarski closed 7 months ago

kolarski commented 10 years ago

Sadly does not support UTF-8. The problem lies here:

getWords : function(doc) {
    if (_(doc).isArray()) {
      return doc;
    }
    var words = doc.split(/\W+/);
    return _(words).uniq();
  }
doc.split(/\W+/);

does not seem to work for UTF-8

Here is an example with Cyrilic language (like Russian):

"Надежда за обич еп.36 Тест".split(/\W+/);

This returns:

[ "", "36", "" ]

Instead should return something like this:

[ "Надежда", "за", "обич", "еп", "36", "Тест"]

Fix is provided below:

Replace

\/W+\

with

/[^a-zA-ZA-Яa-я0-9_]+/

for cyrilic support.

tomayac commented 10 years ago

@kolarski While this fixes your concrete problem, it would be far more scalable to switch to xregexp: https://github.com/slevithan/xregexp#unicode, where you have proper "letter" classes.

kolarski commented 10 years ago

Agree, probably the best solution

harthur commented 10 years ago

Thanks for the pull request.

I'm no longer actively maintaining this repo. Try natural's Bayesian classifier for an alternative.