Closed kolarski closed 10 years ago
Sadly does not support UTF-8. The problem lies here:
getWords : function(doc) { if (_(doc).isArray()) { return doc; } var words = doc.split(/\W+/); return _(words).uniq(); }
doc.split(/\W+/);
does not seem to work for UTF-8
Here is an example with Cyrilic language (like Russian):
"Надежда за обич еп.36 Тест".split(/\W+/);
This returns:
[ "", "36", "" ]
Instead should return something like this:
[ "Надежда", "за", "обич", "еп", "36", "Тест"]
I was looking for fix, but ended up here: http://stackoverflow.com/questions/280712/javascript-unicode-regexes
Sadly does not support UTF-8. The problem lies here:
does not seem to work for UTF-8
Here is an example with Cyrilic language (like Russian):
This returns:
Instead should return something like this:
I was looking for fix, but ended up here: http://stackoverflow.com/questions/280712/javascript-unicode-regexes