Closed rjbs closed 11 years ago
I believe that ContributorsFromGit now UTF-8 decodes its input from qx// so, assuming that git is producing UTF-8 output, all should be well for that.
If I understand you correctly, File object content stays bytes in and bytes out, but it can be annotated with encodings so that mungers can translate from bytes to characters, do manipulation, and put it back. Possibly, we need a plugin to specify encodings when the default doesn't serve:
[Encodings]
t/data/latin1.txt = latin1
t/data/raw.txt = raw
t/data/arigato.txt = euc-jp
[Encodings::Match]
^t/data/jp/ = euc-jp
^t/data/images/ = raw
That should run after gathering but before munging.
I wonder if the inverse approach with File would be easier on the ecosystem: make content
be characters and have encoded_content
be bytes. There are only a few file sources (and fewer outside dzil core) but there are a lot more mungers, I think. If mungers can continue using content
for character manipulation, they don't need to be rewritten. Only the file gatherers and other file sources need to be rewritten to store data to encoded_content
instead of content
and then mungers can operation on content
with conversion on demand.
I definitely agree that we need a way to mark files as not having the default encoding, and the mechanism you suggest seems good to me, at least at first blush. I had considered more or less exactly the same thing, myself, which I take (tentatively:)) as a good sign.
Swapping the defaults also sounds plausible. I think I'd originally disliked this when I'd considered it because I had the stupid thought that "it would mean that we'd have to decode eagerly to have 'content' populated," but this is bunk, for various reasons. I think it could work.
For completeness, Config::INI should also be adapted: it is used by dzil authordeps
and uses Mixin::Linewise.
I think that all those welcome changes would deserve a major version bump of Dist::Zilla.
I believe that Config::INI's behavior will be corrected by rjbs/Mixin-Linewise#3's application.
I also agree that I see this as being v5, pushing back my other ideas. :)
Without knowing the guts of your Pod modules, my thought on =encoding
is either of:
=encoding
is seen=encoding
=encoding
and doesn't match (or transcode, ick).The more I think about it =encoding
is pretty much useless unless a Pod parser is responsible for reading from a file itself and forces raw mode. That might be fine for perldoc
but it's pretty useless for general purpose file and Pod manipulation tools.
For Software::License and other stuff, Data::Section is going to need some thought. It just reads the DATA handle, which will have whatever layers Perl leaves on it from loading the file. I've already tested with/without "use utf8". With that pragma, DATA gets ":utf8". Without it, it appears to have default layers.
Thus, Data::Section will return bytes/text depending on the utf8 (or other) pragma, and if it's bytes, the encoding will depend on how someone's text editor saved it.
Maybe Data::Section should binmode the DATA handle, read raw, and then assume UTF-8 and decode to characters unless someone provides an alternate encoding in the section divider (where the filename goes). That supports UTF-8 in DATA with or without "use utf8" pragma, and if someone has some other pragma in effect, it's up to them to modify their section dividers to match.
About Data::Section, wouldn't it be enough to document that it returns text (leaving the source code author decide how Perl should intrepret that with utf8/encoding pragmas)? I don't see how 'raw' would be helpful. Note that the documentation already says « Data::Section provides an easy way to access multiple named chunks of line-oriented data ». Are CR/LF vs LF preserved?
@dolmen, I think it should read raw, but then decode it using a reasonable default -- and make it easy for someone to change that default. E.g.
use Data::Section -setup => { encoding => "latin1" };
...
Filenames could also be encoded and need to be handled properly...
When I dzilified Acme-LookOfDisapproval, I see:
Use of uninitialized value in bitwise and (&) at /Users/ether/.perlbrew/libs/19.3@std/lib/perl5/Dist/Zilla/Plugin/GatherDir.pm line 109.
I inserted a 'say $filename' before this line (with no alteration of layers on STDOUT), and the filename in question is "lib/Acme/\340\262\240_\340\262\240.pm".
Filenames could also be encoded and need to be handled properly...
There is no useful, portable way to know how to map a non-ASCII filename to a sequence of bytes to be passed to open. I'm quite tempted to just forbid non-ASCII filenames until such a method can be pointed out.
I probably won't, though, but will say: don't do that unless you control all the filesystems where you'll be deploying.
rjbs
We should leave filename as bytes and work around them as best as possible. The errror in question appears to be from a failing stat(). We can detect that and do something "appropriate".
I'm guessing that the manipulation of the filename to get it relative did something screwy before it could stat, but I don't know what. Even if it wound up characters, I would think that perl's stat() would ignore that and just use the bytes in the internal string.
There is at least one platform where filenames are guaranteed to be properly encoded, and where they can contain even more-than-8-bits Unicode codepoints: Windows. Unfortunately, AFAIK (and I admit I don't know much about perl's implementation), perl 5 core still uses only the 8 bits API instead of the Unicode one which means some files are not readable from Perl without using some Win32:: moduel (I wonder if perl is much used on Windows in China due to this...)
But anyway, the result of a Dist::Zilla-built distribution is a .tar.gz where filenames' encoding is not specified. So non-ASCII filenames in the distribution are not portable. (As a side note, Java has properly encoded containers: .jar files are zip files with UTF-8 filenames). Proper filename encoding may still interest perl authors of platform-specific distributions (such as Win32::* modules).
related encoding issues involving dzil plugins:
Pod::Text looks like it detects and handles encodings.
But I Pod::Eventual does not (yet), I believe.
There's two issues for dzil, I think:
I think if we can detect (or have a hint) about encoding when files are read, then munging can happen on decoded data (characters), and then be written back out with the necessary encoding.
An edge case I can see already is when a file is, say, ASCII, and then winds up through munging to have characters > 0x7F without either "use utf8" or "=encoding utf8". But that could just be a warning.
An edge case I can see already is when a file is, say, ASCII, and then winds up through munging to have characters > 0x7F without either "use utf8" or "=encoding utf8". But that could just be a warning.
Right, and we've seen this situation happen already (e.g. META.yml). Would using :encoding(UTF-8)
to verify that the resulting text is actually valid unicode be of use? If a build operation dies with an appropriate warning, then we (the user) can see what files are problematic and track down the plugins that are munging incorrectly.
Also, additional tests (release tests like Test::Mojibake, or CPANTS analyzers) can help here.
https://rt.cpan.org/Ticket/Display.html?id=89350 https://rt.cpan.org/Ticket/Display.html?id=88424
I wasn't clear. I meant if a gathered file starts with no apparent encoding and then later gains wide characters without any indication, do we helpfully just upgrade it to UTF-8 encoding?
E.g. pure ASCII module with only ASCII in Pod is gathered into a file object in memory. Then through pod weaving, it gets wide characters in Pod. How do we make sure (a) =encoding utf8
is added to Pod and (b) that file is UTF-8 encoded when dumped into the distribution directory?
We can cover (b) by just defaulting to UTF-8 if prior encoding is not detected. Can we cover (a)? Do we care? I suspect it's really Pod::Weaver-specific so that might be fixable there. If wide characters are detected and there's no =encoding
then be sure to add it.
(With the weirdness of having an in-memory representation say =encoding
while not really being encoded that way. Stupid fucking pod spec, putting an encoding directive inline into the content. (sigh))
I think I understood what you meant, although my response wasn't very coherent.
PodWeaver can and should add an =encoding directive; anything else that munges pod or code is probably on its own, but we have existing tests for that (e.g. Test::Mojibake -- I've certainly run into this test failing for me because dzil inserted latin1 chars when I had an =encoding utf-8
directive).
That specific bug is one I'm sure we'll cover. And we'll dedicate the fix to Olivier Mengué. :-)
And we'll dedicate the fix to Olivier Mengué. :-)
And ilmari and avar, my other two unicode canaries! :)
one other thing I noticed -- there's lots of places in the core where we're writing out files manually to disk (e.g. [ManifestSkip], [NextRelease]... or reading them manually - [TemplateModule], [GatherDir::Template]). We might be able to shrink the code a bit by using Dist::Zilla::File::OnDisk for all of these operations (to get the 'encoding' property, not having to worry about manually setting io layers etc) -- after all, the $file object doesn't have to be a member of $zilla->files - we can instantiate them separately.
I suspect this is related http://stackoverflow.com/questions/19440859/wrong-file-encoding-after-distzilla output file-encoding is wrong. ( reads all the way through, definitely, urgh, temporary fixes? )
yes
I am closing this ticket!
We believe we have addressed the problem in Dist::Zilla v5.
For more information, consult http://rjbs.manxome.org/rubric/entry/2021
Dist::Zilla was hastily cobbled together, and nowhere does that show more than in its poor support for characters outside of ASCII. It has no real concept of encoding/decoding boundaries. Worse still, many of these boundaries would need to be enforced not by Dist::Zilla, but by in external libraries written to support it.
Inputs to Fix
=encoding
dzil setup
should properly transcode to the produced file; assuming UTF-8 at both ends is not good enough (I guess Win32 is still not UTF-8 terminal by default?)content
as bytes; they should keep doing so, but should also have anencoding
attribute and adecoded_content
method; when these are mutable, they should be kept in sync automatically and lazily (task needed)content
fatalInMemory
andFromCode
) will need to generate them correctly, primingcontent
with bytes ordecoded_content
with text; specifically:Outputs to Fix
Dist::Zilla::_write_out_file
is probably okay, as->content
will still be bytes (but more correct) in the futureMiscellany