Does not word with non-ascii chars

GoogleCodeExporter commented 8 years ago

What steps will reproduce the problem?
1. Try with any word with non-ascii chars (é,ç,ñ)
2.
3.

What is the expected output? What do you see instead?

When removing the punctuation, non-ascii chars are replaced by spaces.

I've managed to modify so I can use it for Catalan, but it would be a good 
idea to make it UTF-compliant

Replace

var self = this, node = this.$domObj.get(0).nodeName, tagExp = '<[^>]+>', 
puncExp = '^\\W|[\\W]+\\W|\\W$|\\n|\\t|\\s{2,}';

With

var lletres = '[^\\wàéèíóòúÀÉÈÍÓÒÚçÇ·ŀ]'
var self = this, node = this.$domObj.get(0).nodeName, tagExp = '<[^>]+>', 
puncExp = 
'^'+lletres+'|'+lletres+'+'+lletres+'|'+lletres+'$|\\n|\\t|\\s{2,}';

Original issue reported on code.google.com by xavi.ivars on 10 Feb 2010 at 5:48

GoogleCodeExporter commented 8 years ago

Last comment was about jquery.spellchecker.js, at line 62

Original comment by xavi.ivars on 10 Feb 2010 at 5:53

GoogleCodeExporter commented 8 years ago

I have also problems in checkspelling.php, at line 49, related with UTF-8

I can solve them by replacing

$words[] = utf8_encode(html_entity_decode($word));

with

$words[] = html_entity_decode($word);

Original comment by xavi.ivars on 10 Feb 2010 at 5:56

GoogleCodeExporter commented 8 years ago

Maybe using XRegExp for regular expressions instead of native RegExp Object 
would be a 
good idea.

http://xregexp.com/

Original comment by xavi.ivars on 10 Feb 2010 at 6:43

GoogleCodeExporter commented 8 years ago

Thank you for letting me know about this (somewhat serious) issue. I need to 
get a
better understanding of how Javascript regex matches word characters to fix 
this.
Thanks for your code examples and the link, I am currently looking into this.

Original comment by willis...@gmail.com on 10 Feb 2010 at 8:10

Changed state: Accepted

GoogleCodeExporter commented 8 years ago

i've changed how i'm matching non-word characters, and i think its matching most
unicode characters. please can you let me me if this change works for you. 

http://code.google.com/p/jquery-spellchecker/source/detail?r=89

Original comment by willis...@gmail.com on 10 Feb 2010 at 9:22

GoogleCodeExporter commented 8 years ago

highlighting the words is broken though, i'm wording on that now..

Original comment by willis...@gmail.com on 10 Feb 2010 at 9:36

GoogleCodeExporter commented 8 years ago

With revision 89 non-word characters works fine, thanks ;)

There is another problem, with word boundaries (\b): when a word starts/ends 
with a 
non-ascii character, RegExp(\b+word+\b) does not work properly.

Original comment by xavi.ivars on 10 Feb 2010 at 10:09

GoogleCodeExporter commented 8 years ago

I've made a change to fix this, although google is not returning the correct 
data for
the example word 'wàéèíóòúÀÉÈÍÓÒÚ'

It works fine with pspell, i'll need to spend more time looking at this issue. 

if you want to test please use the latest revision, i commited some debug code 
by
mistake.

Original comment by willis...@gmail.com on 10 Feb 2010 at 11:04

Added labels: Component-Logic, OpSys-All, Priority-High, Usability
Removed labels: Priority-Medium

GoogleCodeExporter commented 8 years ago

you can test using the 
[http://spellchecker.jquery.badsyntax.co.uk/#textarea-example
textarea demo]. using pspell it returns the matched bad word, google returns
something different

Original comment by willis...@gmail.com on 10 Feb 2010 at 11:17

GoogleCodeExporter commented 8 years ago

Thanks! I'm using pspell, so no problems for me with google bug ;)

Anyway if you think I could help you with anything, please tell me ;)

Original comment by xavi.ivars on 10 Feb 2010 at 11:18

GoogleCodeExporter commented 8 years ago

Thanks dude, you've already been a big help! I'll keep this issue ticket open 
until
i'm confident the changes i've made works with a variety of different unicode 
word
characters (not sure the regex range is suffice at this time).

Original comment by willis...@gmail.com on 11 Feb 2010 at 9:46

GoogleCodeExporter commented 8 years ago

Have you managed to fix bug with Google?

Can it be related with "strlen" in line 111 of checkspelling.php?

Original comment by xavi.ivars on 11 Feb 2010 at 11:00

GoogleCodeExporter commented 8 years ago

it makes sense that's where the problem is, i've been unable to come up with a
solution, mb_strlen() also doesn't do it. i'll need to spend more time on this, 
i'm
not experienced with encodings :/

Original comment by willis...@gmail.com on 12 Feb 2010 at 12:08

GoogleCodeExporter commented 8 years ago

At rev142 and using pspell, all the words having french accents (éè etc) 
would get
marked as having a mistake and all the suggestions with accents would show up 
as null
in the popup.

To fix this, I modified the two following lines

40:  exit(json_encode(array_map(utf8_encode, pspell_suggest($pspell_link,
utf8_decode(urldecode($suggest))))));

45:  foreach($text = explode(' ', utf8_decode(urldecode($text))) as $word) {

It seems to be working.

Original comment by fmail...@gmail.com on 14 Apr 2010 at 9:32

GoogleCodeExporter commented 8 years ago

I was having issues with utf-8 characters and pspell. I modified the following 
two lines in checkspelling.php:

36: $pspell_link = pspell_new_personal($this->pspell_personal_dictionary, 
$this->lang,"","","utf-8");

57: @pspell_add_to_personal($pspell_link, 
utf8_decode(strtolower($addtodictionary))) or die('You can\'t add a word to the 
dictionary that contains any punctuation.');

Original comment by gustaf...@gmail.com on 20 Aug 2010 at 4:14

GoogleCodeExporter commented 8 years ago

One more fix. I also had to modify lines 39-41:

if (isset($suggest)) {
  $suggestions = pspell_suggest($pspell_link, urldecode($suggest));
  foreach($suggestions as $k => $val)
    $suggestions[$k] = htmlentities($val,ENT_NOQUOTES,'UTF-8');
  exit(json_encode($suggestions));  
}

Original comment by gustaf...@gmail.com on 20 Aug 2010 at 4:36

GoogleCodeExporter commented 8 years ago

Re the google issue, it seems to be that substr doesn't like unicode characters 
and so the badword box has the wrong offset for the substr and displays a funny 
selection.

$word = substr($text, $word[1], $word[2]);

I seem to have got around this 

$word = substr(utf8_decode($text), $word[1], $word[2]);

While this may cause some encoding issue in the actual $text variable it seems 
to allow for correct calculation of substr.

Alternatively this works as well:

$word = mb_substr($text, $word[1], $word[2], 'utf8');

but as mbstring is a non-default php extension it may not work for everyone and 
mbstring functions won't correctly work.

Original comment by andrew.m...@evolvingmedia.co.uk on 10 May 2011 at 12:45

GoogleCodeExporter commented 8 years ago

Hi, 
I am facing the problem with '&' (ampersand) , when my text contains the 
character '&' the google check spell is not working but it is working fine when 
i removing the '&' from text.

check spelling is not working for the below Text:
Madison Short Sleeve Organic Babydoll Tee Shirt &  Victory Pump

check spelling is working fine after removing the '&':
Madison Short Sleeve Organic Babydoll Tee Shirt  Victory Pump

Please let me know how can i fix this issue.

Thanks
Srinivas Gade
Email:gsrinivas2186@gmail.com

Original comment by gsriniva...@gmail.com on 22 Nov 2011 at 1:18

GoogleCodeExporter commented 8 years ago

I've been unable to replicate any of the issues mentioned in this ticket using 
the new version of the plugin, which can be found here: 
https://github.com/badsyntax/jquery-spellchecker

Please can you test here: http://jquery-spellchecker.badsyntax.co/textarea.html

If you find any issues please create a ticket in the new issue tracker here: 
https://github.com/badsyntax/jquery-spellchecker/issues

Original comment by willis...@gmail.com on 20 Oct 2012 at 4:47

Changed state: Done

shankarsh1 / jquery-spellchecker

Does not word with non-ascii chars #5