Unicode words improperly encoded

dagolden commented 11 years ago

This is a continuation of discussion in issue #9.

What I found is that Test::Spelling gives a file handle opened to a string to Pod::Spell, but does not set the :utf8 layer on it. Thus, when Pod::Spell prints a Unicode character to the handle, I think it goes in as whatever Perl's internal representation is. That might be a byte from 0x80 to 0xff or it could be UTF-8. My copy of aspell doesn't bother to report the weirdly encoded word.

When I hotpatched Test::Spelling to set :utf8, then I got the same spelling error on virtù that was mentioned in #9. So, on the one hand, we need to add virtù as a stopword, but the bigger problem is that we need to fix what we do with encodings.

I'm not sure if we should UTF-8 encode output ourselves (or set :utf8 on the handle ourselves) or if Test::Spelling should be fixed. I'm not sure what the spellcheckers are expecting on STDIN and whether or not that depends on the current locale of the user.

So... I'm not sure what the answer is.

As a bit of good news, I actually looked at unique "words" found by Pod::Spell on its own docs and have some tweaks for the word munging to fix up some things.

dagolden commented 11 years ago

@pghmcfc, I couldn't replicate your error on v5.14.4. What is your locale set to? I'm wondering if that has an effect on things.

xenoterracide commented 11 years ago

note: there's a virt stopword that I believe was added simply for virtù also bug digging for encoding bugs mentioned elsewhere...

xenoterracide commented 11 years ago

https://github.com/sartak/test-spelling/pull/3 This may have relevant encoding information (part of the reason I adopted this module was to hopefully eventually fix this )

dagolden commented 11 years ago

Aspell has an --encoding option, but defaults to locale. Possibly others do a well. If we set :utf8 on the output handle and document that, and if Test::Spelling sets the right encoding option to UTF-8, that might even handle people with non UTF-8 locales.

Another moving part is IPC::Run3 pumping that string to STDIN. I have no idea if that affects encoding. I think that if we set :utf8 on the handle, then the string just winds up with octets and gets passed to STDIN as octets and that's what we want.

Highlighting @sartak and @karenetheridge to consider Test::Spelling implications.

dagolden commented 11 years ago

I've created a branch topic/utf8 with a test that demonstrates the problem. Interestingly, it only really failed when I added Japanese, which changes the internal representation spewed to the output handle.

I suspect that just calling binmode to set :utf8 on the output handle will "fix" this, but then Test::Spelling needs to tell spell checkers to expect UTF-8 on STDIN and ignore locale.

I'm going out for a while, so I'll check back later.

karenetheridge commented 11 years ago

For aspell: aspell --encoding=utf-8 -c <file>

ispell can only handle latin1 (the documentation explicitly states that non-european character sets aren't supported) -- http://mintaka.sdsu.edu/GF/bibliog/ispell/multi.html -- so if ispell is configured as the preferred checker the best we can do is raise a warning if there are characters outside this range in the stream

hunspell seems to be in the same boat as ispell -- http://sourceforge.net/projects/hunspell/files/Hunspell/Documentation/

pghmcfc commented 11 years ago

hunspell has a -i option to specify the input encoding; doesn't that help?

dagolden commented 11 years ago

Perhaps Pod::Spell needs an option (to pass on to Pod::Wordlist) for whether to strip words containing characters above 0xff.

Then the handle binmode can be left to Test::Spelling (or whatever else uses it). Depending on the spellchecker available, it can set the "no_wide_chars" setting and set the binmode on the handle to ":utf8" or ":encoding(latin1)".

dagolden commented 11 years ago

I've implemented no_wide_chars and added documentation about Encoding issues: 1c2da66adbd6ee258d5f74421bff3177652694de

I think the ball is now in Test::Spelling's court to do the following:

switch to parse_from_filehandle
set binmode appropriately for input files
set binmode appropriately for output destination spell checker
set correct encoding command-line options for aspell and hunspell
set the no_wide_chars flag to true for ispell and spell (?)

pghmcfc commented 11 years ago

@dagolden, regarding the locale I use: by default everything is run in the "C" locale:

$ locale
LANG=C
LC_CTYPE="C"
LC_NUMERIC="C"
LC_TIME="C"
LC_COLLATE="C"
LC_MONETARY="C"
LC_MESSAGES="C"
LC_PAPER="C"
LC_NAME="C"
LC_ADDRESS="C"
LC_TELEPHONE="C"
LC_MEASUREMENT="C"
LC_IDENTIFICATION="C"
LC_ALL=

Now if I set LANG=en_US.UTF-8 then the old versions of Fedora (prior to Fedora 17) seem to start passing the 1.09 test suite again, but the new versions, which were working for the "C" locale, do this:

Unable to find a working spellchecker:
    Unable to run 'hunspell -l': spellchecker had errors: This UTF-8 encoding can't convert to UTF-16:
ù". If you left the out that would mean
This UTF-8 encoding can't convert to UTF-16:
ù". If you left the out that would mean
This UTF-8 encoding can't convert to UTF-16:
ù". If you left the out that would mean
 at t/author-pod-spell.t line 19
# Looks like you planned 3 tests but ran 2.
# Looks like your test exited with 25 just after 2.
t/author-pod-spell.t ..........
Dubious, test returned 25 (wstat 6400, 0x1900)
Failed 1/3 subtests

The "ù" there is ISO-8859-1 encoded.

I tried running hunspell manually against data with different encodings and got:

$ cat > testwords
test
virtù
$ file testwords
testwords: UTF-8 Unicode text
$ iconv -f utf8 -t iso-8859-1 < testwords > testwords.latin
$ mv testwords testwords.utf8
$ iconv -f utf8 -t utf16 < testwords.utf8 > testwords.utf16
$ file testwords.*
testwords.latin: ISO-8859 text
testwords.utf16: Little-endian UTF-16 Unicode text
testwords.utf8:  UTF-8 Unicode text
$ LANG=en_US.UTF-8 hunspell -l < testwords.latin
This UTF-8 encoding can't convert to UTF-16:

This UTF-8 encoding can't convert to UTF-16:

virt
This UTF-8 encoding can't convert to UTF-16:

$ LANG=en_US.UTF-8 hunspell -l < testwords.utf8
virtù
$ LANG=en_US.UTF-8 hunspell -l < testwords.utf16
This UTF-8 encoding can't convert to UTF-16:
��t
This UTF-8 encoding can't convert to UTF-16:
�t
This UTF-8 encoding can't convert to UTF-16:
��t
This UTF-8 encoding can't convert to UTF-16:
�t

So it seems that the test would have worked if hunspell had been fed UTF-8 encoded data, but it looks like it was fed ISO-8859 data instead, despite the LANG setting.

dagolden commented 11 years ago

That's pretty much what I would expect with Test::Spelling not fixed the way I describe above.

Pod::Spell takes a handle in and a handle out. It's up to Test::Spelling to manage those handles correctly for the spellchecker it selects.

pghmcfc commented 11 years ago

Updating to Pod-Spell HEAD and current Test-Spelling with hunspell and en_US.UTF-8 has this working for me now.

xenoterracide commented 10 years ago

afaik, this was resolved

perl-pod / Pod-Spell

Unicode words improperly encoded #11