Fix support for non-ASCII text

rjbs commented 11 years ago

Dist::Zilla was hastily cobbled together, and nowhere does that show more than in its poor support for characters outside of ASCII. It has no real concept of encoding/decoding boundaries. Worse still, many of these boundaries would need to be enforced not by Dist::Zilla, but by in external libraries written to support it.

Inputs to Fix

Mixin::Linewise is used to read both Pod and configuration; it should be reading text, not bytes; rjbs/Mixin-Linewise#3
Pod::Eventual and Pod:Elemental are used to parse Pod; they will, in theory, be fixed by the fix for Linewise, but we will want something in place to think about =encoding
dzil setup should properly transcode to the produced file; assuming UTF-8 at both ends is not good enough (I guess Win32 is still not UTF-8 terminal by default?)
File objects return their content as bytes; they should keep doing so, but should also have an encoding attribute and a decoded_content method; when these are mutable, they should be kept in sync automatically and lazily (task needed)
we can (and probably should) make wide characters in content fatal
things that generate in-memory files (that's both InMemory and FromCode) will need to generate them correctly, priming content with bytes or decoded_content with text; specifically:
- Software::License needs to clarify whether it returns text or bytes; (it should be text!)
- InlineFiles needs to clarify whether it returns text or bytes; (it should be bytes!)
- files created from templates stored in source should be checked; in most cases, right now, they are bytes; if the file uses utf8, make sure it's text, etc.
we should make at least a cursory review of popular plugins that will get things wrong; for example, ContributorsFromGit, I think, will be sticking bytes into places that we'll be expecting strings
abstract extraction needs to be fixed to extract text rather than bytes
Outputs to Fix
Dist::Zilla::_write_out_file is probably okay, as ->content will still be bytes (but more correct) in the future
STDOUT/STDERR may need encoding layers applied
...but the Test chrome should be capturing strings, if easy enough
...are there any more outputs?
Miscellany
Does PPI prefer octets? Text? Or is it just broken here? Find out.

dagolden commented 11 years ago

I believe that ContributorsFromGit now UTF-8 decodes its input from qx// so, assuming that git is producing UTF-8 output, all should be well for that.

If I understand you correctly, File object content stays bytes in and bytes out, but it can be annotated with encodings so that mungers can translate from bytes to characters, do manipulation, and put it back. Possibly, we need a plugin to specify encodings when the default doesn't serve:

[Encodings]
t/data/latin1.txt  = latin1
t/data/raw.txt     = raw
t/data/arigato.txt = euc-jp

[Encodings::Match]
^t/data/jp/     = euc-jp
^t/data/images/ = raw

That should run after gathering but before munging.

I wonder if the inverse approach with File would be easier on the ecosystem: make content be characters and have encoded_content be bytes. There are only a few file sources (and fewer outside dzil core) but there are a lot more mungers, I think. If mungers can continue using content for character manipulation, they don't need to be rewritten. Only the file gatherers and other file sources need to be rewritten to store data to encoded_content instead of content and then mungers can operation on content with conversion on demand.

rjbs commented 11 years ago

I definitely agree that we need a way to mark files as not having the default encoding, and the mechanism you suggest seems good to me, at least at first blush. I had considered more or less exactly the same thing, myself, which I take (tentatively:)) as a good sign.

Swapping the defaults also sounds plausible. I think I'd originally disliked this when I'd considered it because I had the stupid thought that "it would mean that we'd have to decode eagerly to have 'content' populated," but this is bunk, for various reasons. I think it could work.

dolmen commented 11 years ago

For completeness, Config::INI should also be adapted: it is used by dzil authordeps and uses Mixin::Linewise.

I think that all those welcome changes would deserve a major version bump of Dist::Zilla.

rjbs commented 11 years ago

I believe that Config::INI's behavior will be corrected by rjbs/Mixin-Linewise#3's application.

I also agree that I see this as being v5, pushing back my other ideas. :)

dagolden commented 11 years ago

Without knowing the guts of your Pod modules, my thought on =encoding is either of:

have dzil pass the encoded content string and have Pod::Eventual start decoding line-by-line once =encoding is seen
have dzil pass the decoded content string and assume the user got the encoding right when getting data from disk and just ignore =encoding
have dzil pass the decoded content string and (as an extra argument) pass the original file encoding and have Pod::Eventual (or whatever) warn/die if it sees =encoding and doesn't match (or transcode, ick).

The more I think about it =encoding is pretty much useless unless a Pod parser is responsible for reading from a file itself and forces raw mode. That might be fine for perldoc but it's pretty useless for general purpose file and Pod manipulation tools.

dagolden commented 11 years ago

For Software::License and other stuff, Data::Section is going to need some thought. It just reads the DATA handle, which will have whatever layers Perl leaves on it from loading the file. I've already tested with/without "use utf8". With that pragma, DATA gets ":utf8". Without it, it appears to have default layers.

Thus, Data::Section will return bytes/text depending on the utf8 (or other) pragma, and if it's bytes, the encoding will depend on how someone's text editor saved it.

Maybe Data::Section should binmode the DATA handle, read raw, and then assume UTF-8 and decode to characters unless someone provides an alternate encoding in the section divider (where the filename goes). That supports UTF-8 in DATA with or without "use utf8" pragma, and if someone has some other pragma in effect, it's up to them to modify their section dividers to match.

dolmen commented 11 years ago

About Data::Section, wouldn't it be enough to document that it returns text (leaving the source code author decide how Perl should intrepret that with utf8/encoding pragmas)? I don't see how 'raw' would be helpful. Note that the documentation already says « Data::Section provides an easy way to access multiple named chunks of line-oriented data ». Are CR/LF vs LF preserved?

dagolden commented 11 years ago

@dolmen, I think it should read raw, but then decode it using a reasonable default -- and make it easy for someone to change that default. E.g.

use Data::Section -setup => { encoding => "latin1" };
...

karenetheridge commented 11 years ago

Filenames could also be encoded and need to be handled properly...

When I dzilified Acme-LookOfDisapproval, I see:

Use of uninitialized value in bitwise and (&) at /Users/ether/.perlbrew/libs/19.3@std/lib/perl5/Dist/Zilla/Plugin/GatherDir.pm line 109.

I inserted a 'say $filename' before this line (with no alteration of layers on STDOUT), and the filename in question is "lib/Acme/\340\262\240_\340\262\240.pm".

rjbs commented 11 years ago

Karen Etheridge notifications@github.com [2013-09-10T14:39:09]

Filenames could also be encoded and need to be handled properly...

There is no useful, portable way to know how to map a non-ASCII filename to a sequence of bytes to be passed to open. I'm quite tempted to just forbid non-ASCII filenames until such a method can be pointed out.

I probably won't, though, but will say: don't do that unless you control all the filesystems where you'll be deploying.

rjbs

dagolden commented 11 years ago

We should leave filename as bytes and work around them as best as possible. The errror in question appears to be from a failing stat(). We can detect that and do something "appropriate".

I'm guessing that the manipulation of the filename to get it relative did something screwy before it could stat, but I don't know what. Even if it wound up characters, I would think that perl's stat() would ignore that and just use the bytes in the internal string.

dolmen commented 11 years ago

There is at least one platform where filenames are guaranteed to be properly encoded, and where they can contain even more-than-8-bits Unicode codepoints: Windows. Unfortunately, AFAIK (and I admit I don't know much about perl's implementation), perl 5 core still uses only the 8 bits API instead of the Unicode one which means some files are not readable from Perl without using some Win32:: moduel (I wonder if perl is much used on Windows in China due to this...)

But anyway, the result of a Dist::Zilla-built distribution is a .tar.gz where filenames' encoding is not specified. So non-ASCII filenames in the distribution are not portable. (As a side note, Java has properly encoded containers: .jar files are zip files with UTF-8 filenames). Proper filename encoding may still interest perl authors of platform-specific distributions (such as Win32::* modules).

karenetheridge commented 11 years ago

related encoding issues involving dzil plugins:

https://github.com/xenoterracide/Pod-Spell/issues/11

~~https://github.com/creaktive/Test-Mojibake/issues/6~~

karenetheridge commented 11 years ago

Pod::Text looks like it detects and handles encodings.

dagolden commented 11 years ago

But I Pod::Eventual does not (yet), I believe.

There's two issues for dzil, I think:

detecting and preserving encoding of entire files
ensuring that munging doesn't mix encodings

I think if we can detect (or have a hint) about encoding when files are read, then munging can happen on decoded data (characters), and then be written back out with the necessary encoding.

An edge case I can see already is when a file is, say, ASCII, and then winds up through munging to have characters > 0x7F without either "use utf8" or "=encoding utf8". But that could just be a warning.

karenetheridge commented 11 years ago

An edge case I can see already is when a file is, say, ASCII, and then winds up through munging to have characters > 0x7F without either "use utf8" or "=encoding utf8". But that could just be a warning.

Right, and we've seen this situation happen already (e.g. META.yml). Would using :encoding(UTF-8) to verify that the resulting text is actually valid unicode be of use? If a build operation dies with an appropriate warning, then we (the user) can see what files are problematic and track down the plugins that are munging incorrectly.

Also, additional tests (release tests like Test::Mojibake, or CPANTS analyzers) can help here.

https://rt.cpan.org/Ticket/Display.html?id=89350 https://rt.cpan.org/Ticket/Display.html?id=88424

dagolden commented 11 years ago

I wasn't clear. I meant if a gathered file starts with no apparent encoding and then later gains wide characters without any indication, do we helpfully just upgrade it to UTF-8 encoding?

E.g. pure ASCII module with only ASCII in Pod is gathered into a file object in memory. Then through pod weaving, it gets wide characters in Pod. How do we make sure (a) =encoding utf8 is added to Pod and (b) that file is UTF-8 encoded when dumped into the distribution directory?

We can cover (b) by just defaulting to UTF-8 if prior encoding is not detected. Can we cover (a)? Do we care? I suspect it's really Pod::Weaver-specific so that might be fixable there. If wide characters are detected and there's no =encoding then be sure to add it.

(With the weirdness of having an in-memory representation say =encoding while not really being encoded that way. Stupid fucking pod spec, putting an encoding directive inline into the content. (sigh))

karenetheridge commented 11 years ago

I think I understood what you meant, although my response wasn't very coherent.

if all we had was ascii (up to 0x7f) before, then upgrading to utf8 is safe and the right thing to do
if we had latin1 (single-octet) chars before, I think we'd need to upgrade the original content first before inserting our unicode content, or we'll end up with mojibake.

PodWeaver can and should add an =encoding directive; anything else that munges pod or code is probably on its own, but we have existing tests for that (e.g. Test::Mojibake -- I've certainly run into this test failing for me because dzil inserted latin1 chars when I had an =encoding utf-8 directive).

dagolden commented 11 years ago

That specific bug is one I'm sure we'll cover. And we'll dedicate the fix to Olivier Mengué. :-)

karenetheridge commented 11 years ago

And we'll dedicate the fix to Olivier Mengué. :-)

And ilmari and avar, my other two unicode canaries! :)

one other thing I noticed -- there's lots of places in the core where we're writing out files manually to disk (e.g. [ManifestSkip], [NextRelease]... or reading them manually - [TemplateModule], [GatherDir::Template]). We might be able to shrink the code a bit by using Dist::Zilla::File::OnDisk for all of these operations (to get the 'encoding' property, not having to worry about manually setting io layers etc) -- after all, the $file object doesn't have to be a member of $zilla->files - we can instantiate them separately.

xenoterracide commented 11 years ago

I suspect this is related http://stackoverflow.com/questions/19440859/wrong-file-encoding-after-distzilla output file-encoding is wrong. ( reads all the way through, definitely, urgh, temporary fixes? )

karenetheridge commented 11 years ago

yes

rjbs commented 11 years ago

I am closing this ticket!

We believe we have addressed the problem in Dist::Zilla v5.

For more information, consult http://rjbs.manxome.org/rubric/entry/2021

rjbs / Dist-Zilla

Fix support for non-ASCII text #212

Inputs to Fix

Outputs to Fix

Miscellany