Closed dagolden closed 10 years ago
@pghmcfc, I couldn't replicate your error on v5.14.4. What is your locale set to? I'm wondering if that has an effect on things.
note: there's a virt
stopword that I believe was added simply for virtù
also bug digging for encoding bugs mentioned elsewhere...
https://github.com/sartak/test-spelling/pull/3 This may have relevant encoding information (part of the reason I adopted this module was to hopefully eventually fix this )
Aspell has an --encoding option, but defaults to locale. Possibly others do a well. If we set :utf8 on the output handle and document that, and if Test::Spelling sets the right encoding option to UTF-8, that might even handle people with non UTF-8 locales.
Another moving part is IPC::Run3 pumping that string to STDIN. I have no idea if that affects encoding. I think that if we set :utf8 on the handle, then the string just winds up with octets and gets passed to STDIN as octets and that's what we want.
Highlighting @sartak and @karenetheridge to consider Test::Spelling implications.
I've created a branch topic/utf8
with a test that demonstrates the problem. Interestingly, it only really failed when I added Japanese, which changes the internal representation spewed to the output handle.
I suspect that just calling binmode to set :utf8 on the output handle will "fix" this, but then Test::Spelling needs to tell spell checkers to expect UTF-8 on STDIN and ignore locale.
I'm going out for a while, so I'll check back later.
For aspell: aspell --encoding=utf-8 -c <file>
ispell can only handle latin1 (the documentation explicitly states that non-european character sets aren't supported) -- http://mintaka.sdsu.edu/GF/bibliog/ispell/multi.html -- so if ispell is configured as the preferred checker the best we can do is raise a warning if there are characters outside this range in the stream
hunspell seems to be in the same boat as ispell -- http://sourceforge.net/projects/hunspell/files/Hunspell/Documentation/
hunspell has a -i option to specify the input encoding; doesn't that help?
Perhaps Pod::Spell needs an option (to pass on to Pod::Wordlist) for whether to strip words containing characters above 0xff.
Then the handle binmode can be left to Test::Spelling (or whatever else uses it). Depending on the spellchecker available, it can set the "no_wide_chars" setting and set the binmode on the handle to ":utf8" or ":encoding(latin1)".
I've implemented no_wide_chars
and added documentation about Encoding issues: 1c2da66adbd6ee258d5f74421bff3177652694de
I think the ball is now in Test::Spelling's court to do the following:
no_wide_chars
flag to true for ispell and spell (?)@dagolden, regarding the locale I use: by default everything is run in the "C" locale:
$ locale
LANG=C
LC_CTYPE="C"
LC_NUMERIC="C"
LC_TIME="C"
LC_COLLATE="C"
LC_MONETARY="C"
LC_MESSAGES="C"
LC_PAPER="C"
LC_NAME="C"
LC_ADDRESS="C"
LC_TELEPHONE="C"
LC_MEASUREMENT="C"
LC_IDENTIFICATION="C"
LC_ALL=
Now if I set LANG=en_US.UTF-8
then the old versions of Fedora (prior to Fedora 17) seem to start passing the 1.09 test suite again, but the new versions, which were working for the "C" locale, do this:
Unable to find a working spellchecker:
Unable to run 'hunspell -l': spellchecker had errors: This UTF-8 encoding can't convert to UTF-16:
ù". If you left the out that would mean
This UTF-8 encoding can't convert to UTF-16:
ù". If you left the out that would mean
This UTF-8 encoding can't convert to UTF-16:
ù". If you left the out that would mean
at t/author-pod-spell.t line 19
# Looks like you planned 3 tests but ran 2.
# Looks like your test exited with 25 just after 2.
t/author-pod-spell.t ..........
Dubious, test returned 25 (wstat 6400, 0x1900)
Failed 1/3 subtests
The "ù" there is ISO-8859-1 encoded.
I tried running hunspell manually against data with different encodings and got:
$ cat > testwords
test
virtù
$ file testwords
testwords: UTF-8 Unicode text
$ iconv -f utf8 -t iso-8859-1 < testwords > testwords.latin
$ mv testwords testwords.utf8
$ iconv -f utf8 -t utf16 < testwords.utf8 > testwords.utf16
$ file testwords.*
testwords.latin: ISO-8859 text
testwords.utf16: Little-endian UTF-16 Unicode text
testwords.utf8: UTF-8 Unicode text
$ LANG=en_US.UTF-8 hunspell -l < testwords.latin
This UTF-8 encoding can't convert to UTF-16:
This UTF-8 encoding can't convert to UTF-16:
virt
This UTF-8 encoding can't convert to UTF-16:
$ LANG=en_US.UTF-8 hunspell -l < testwords.utf8
virtù
$ LANG=en_US.UTF-8 hunspell -l < testwords.utf16
This UTF-8 encoding can't convert to UTF-16:
��t
This UTF-8 encoding can't convert to UTF-16:
�t
This UTF-8 encoding can't convert to UTF-16:
��t
This UTF-8 encoding can't convert to UTF-16:
�t
So it seems that the test would have worked if hunspell
had been fed UTF-8 encoded data, but it looks like it was fed ISO-8859 data instead, despite the LANG setting.
That's pretty much what I would expect with Test::Spelling not fixed the way I describe above.
Pod::Spell takes a handle in and a handle out. It's up to Test::Spelling to manage those handles correctly for the spellchecker it selects.
Updating to Pod-Spell HEAD and current Test-Spelling with hunspell and en_US.UTF-8 has this working for me now.
afaik, this was resolved
This is a continuation of discussion in issue #9.
What I found is that Test::Spelling gives a file handle opened to a string to Pod::Spell, but does not set the
:utf8
layer on it. Thus, when Pod::Spell prints a Unicode character to the handle, I think it goes in as whatever Perl's internal representation is. That might be a byte from 0x80 to 0xff or it could be UTF-8. My copy of aspell doesn't bother to report the weirdly encoded word.When I hotpatched Test::Spelling to set
:utf8
, then I got the same spelling error onvirtù
that was mentioned in #9. So, on the one hand, we need to addvirtù
as a stopword, but the bigger problem is that we need to fix what we do with encodings.I'm not sure if we should UTF-8 encode output ourselves (or set
:utf8
on the handle ourselves) or if Test::Spelling should be fixed. I'm not sure what the spellcheckers are expecting on STDIN and whether or not that depends on the current locale of the user.So... I'm not sure what the answer is.
As a bit of good news, I actually looked at unique "words" found by Pod::Spell on its own docs and have some tweaks for the word munging to fix up some things.