metacpan / metacpan-api

A free, open API for everything you want to know about CPAN
http://www.metacpan.org/
Other
289 stars 196 forks source link

Test Failures (encoding related) in t/server/controller/source.t #230

Closed rwstauner closed 12 years ago

rwstauner commented 12 years ago

The tests added in these commits fail for me and I'm not sure why:

https://github.com/CPAN-API/cpan-api/commit/b5fef1dfd9b7ac5b6121f113e932b43c3ec10026 https://github.com/CPAN-API/cpan-api/commit/55dd9b52b7485780c729a4465e8fc0eb72104d72

ok 30 - decode json
not ok 31 - JSONP-wrapped change-log # TODO corrupt
#   Failed (TODO) test 'JSONP-wrapped change-log'
#   at t/server/controller/source.t line 67.
Wide character in print at /home/rando/perl5/perlbrew/perls/5.14.2-st/lib/5.14.2/Test/Builder.pm line 1759.
#                   '2012-01-01  0.01  First release - codename 'M�nchen'
# '
#     doesn't match '(?^:codename 'M\x{fc}nchen')'
ok 32 - GET /source/DOY/Moose-0.01/Changes
ok 33 - code 200
ok 34 - Content-type
not ok 35 - Change-log content # TODO corrupt
#   Failed (TODO) test 'Change-log content'
#   at t/server/controller/source.t line 55.
Wide character in print at /home/rando/perl5/perlbrew/perls/5.14.2-st/lib/5.14.2/Test/Builder.pm line 1759.
#                   '2012-01-01  0.01  First release - codename 'M�nchen'
# '
#     doesn't match '(?^:codename 'M\x{fc}nchen')'
1..35

I also find this odd:

$ zcat t/var/tmp/fakecpan/authors/id/D/DO/DOY/Moose-0.01.tar.gz | file -    
gzip: t/var/tmp/fakecpan/authors/id/D/DO/DOY/Moose-0.01.tar.gz: invalid compressed data--length error
/dev/stdin: POSIX tar archive

Is anyone else experiencing this corrupted tar issue? Anyone else get these test failures?

Are the two related?

I run cpan-outdated daily so all my modules should be up to date... what version of ES are we using these days?

Any ideas?

rwstauner commented 12 years ago

This is what I get from Data::Dumper and Devel::Peek:

$VAR1 = "2012-01-01  0.01  First release - codename 'M\x{fffd}nchen'
";
SV = PV(0x7ab5780) at 0x141ca38
  REFCNT = 1
  FLAGS = (PADMY,POK,pPOK,UTF8)
  PV = 0x6873eb0 "2012-01-01  0.01  First release - codename 'M\357\277\275nchen'\n"\0 [UTF8 "2012-01-01  0.01  First release - codename 'M\x{fffd}nchen'\n"]
  CUR = 55
  LEN = 56

Note that the tests seem to be expecting \x{cf} instead.

rwstauner commented 12 years ago

@grantm says it's my problem (he's not experiencing the test failure or the corrupted tar)

grantm commented 12 years ago

It would be interesting to see what's in $function_args before the test calls:

ok( my $jsdata = JSON->new->allow_nonref->decode( $function_args ), 'decode json' );
rwstauner commented 12 years ago

I'll try that function_args thing when i come back.

meanwhile after much trouble I got myself a 5.10.1 installed and still have the same problem.

I am running 64bit... is anyone else successfully running 64bit?

Interesting note... on my new 5.10.1 setup the tar is no longer corrupt.

rwstauner commented 12 years ago
use Data::Dumper; print STDERR Dumper($function_args);
use Devel::Peek qw(Dump); Dump($function_args);

$VAR1 = '"2012-01-01  0.01  First release - codename \'M\\ufffdnchen\'\\n"';
SV = PV(0x7677890) at 0xfd7a40
  REFCNT = 1
  FLAGS = (PADMY,POK,pPOK)
  PV = 0x7564530 "\"2012-01-01  0.01  First release - codename 'M\\ufffdnchen'\\n\""\0
  CUR = 61
  LEN = 64
grantm commented 12 years ago

So in my case the contents of $function_args are more in line with expectations:

$VAR1 = '"2012-01-01 0.01 First release - codename \'M\u00fcnchen\'"'; SV = PV(0xc28f148) at 0xc21ba98 REFCNT = 1 FLAGS = (PADMY,POK,pPOK) PV = 0xc25b620 "\"2012-01-01 0.01 First release - codename 'M\u00fcnchen'\""\0 CUR = 59 LEN = 60

Your output shows that the JSON decoding stage is working fine, the incorrect character (\ufffd instead of \u00fc) is being returned by the /source/DOY/Moose-0.01/Changes?callback=foo API end-point. In theory, that endpoint is just returning the contents of the file, so now we need to determine if the incorrect character is in the file or if the /source handler is corrupting it somehow.

If I hex-dump the file I get this:

$ xxd var/tmp/source/DOY/Moose-0.01/Moose-0.01/Changes
0000000: 3230 3132 2d30 312d 3031 2020 302e 3031  2012-01-01  0.01
0000010: 2020 4669 7273 7420 7265 6c65 6173 6520    First release 
0000020: 2d20 636f 6465 6e61 6d65 2027 4dc3 bc6e  - codename 'M..n
0000030: 6368 656e 27                             chen'

The "c3 bc" bytes on the second-to-last line are the UTF8 form of the character \x{FC}. Do you have the same in your file?

rwstauner commented 12 years ago

On the perl that produces the uncorrupted tar I get the same hex-dump (but the test fails in the same way).

The corrupted tar looks like this:

*$ xxd /tmp/corrupt/Moose-0.01/Changes
0000000: 3230 3132 2d30 312d 3031 2020 302e 3031  2012-01-01  0.01
0000010: 2020 4669 7273 7420 7265 6c65 6173 6520    First release 
0000020: 2d20 636f 6465 6e61 6d65 2027 4dfc 6e63  - codename 'M.nc
0000030: 6865 6e27 0a                             hen'.
grantm commented 12 years ago

So on the Perl with the good hex dump and the failing test, the contents of the file are being corrupted by the /source handler :-(

On the Perl that produces a corrupt tar, not only is the file not UTF8, a trailing LF is magically added to the file.

Are both Perls using the same version of CPAN::Faker?

grantm commented 12 years ago

The corruption of the data seems most likely to be happening in the JSONP view . With this debugging added ...

--- a/lib/MetaCPAN/Server/View/JSONP.pm
+++ b/lib/MetaCPAN/Server/View/JSONP.pm
@@ -12,7 +12,9 @@ sub process {
         local($/);
         $body = <$body>;
     }
+use Devel::Peek qw(Dump); Dump($body);
     $body = decode_utf8($body);
+use Devel::Peek qw(Dump); Dump($body);
     my $content_type = $c->res->content_type;
     return 1 if($content_type eq 'text/javascript');
     if($content_type ne 'application/json') {

I get this before and after output:

SV = PV(0xc9b5b20) at 0xae352f0
  REFCNT = 1
  FLAGS = (PADMY,POK,pPOK)
  PV = 0xc9b4900 "2012-01-01  0.01  First release - codename 'M\303\274nchen'"\0
  CUR = 53
  LEN = 80
SV = PV(0xc9b5b20) at 0xae352f0
  REFCNT = 1
  FLAGS = (PADMY,POK,pPOK,UTF8)
  PV = 0xc97c4a0 "2012-01-01  0.01  First release - codename 'M\303\274nchen'"\0 [UTF8 "2012-01-01  0.01  First release - codename 'M\x{fc}nchen'"]
  CUR = 53
  LEN = 56
rwstauner commented 12 years ago
SV = PVIV(0x726c9d8) at 0x47ff410
  REFCNT = 1
  FLAGS = (PADMY,POK,pPOK)
  IV = 0
  PV = 0x72cfe40 "2012-01-01  0.01  First release - codename 'M\374nchen'\n"\0
  CUR = 53
  LEN = 80
SV = PVIV(0x726c9d8) at 0x47ff410
  REFCNT = 1
  FLAGS = (PADMY,POK,pPOK,UTF8)
  IV = 0
  PV = 0x63258c0 "2012-01-01  0.01  First release - codename 'M\357\277\275nchen'\n"\0 [UTF8 "2012-01-01  0.01  First release - codename 'M\x{fffd}nchen'\n"]
  CUR = 55
  LEN = 56

I get that with 5.14.2's corrupted tar and 5.10.1's uncorrupted tar.

What's your perl -V?

rwstauner commented 12 years ago

and what's your lang set to?

env | grep -iE '^(lang|lc|local)'

ES version?

rwstauner commented 12 years ago

/me is grasping at straws

grantm commented 12 years ago

I don't have a LANG setting at all (no output from that env | grep command) so I'm defaulting to the POSIX / "C" locale.

In your Devel::Peek output the 'before' string is wrong. The problem is not that the decode_utf8 in the JSONP handler is corrupting the data, it's that the data is already corrupt immediately after reading the file.

Unless I misread your earlier comment, a hex dump of var/tmp/source/DOY/Moose-0.01/Moose-0.01/Changes showed that the "ü" character was represented in the file as the 2-byte utf8 sequence: "c3 bc". However when the contents of the file are slurped in by the JSONP handler the character is represented as "\374" and the string does not have the utf8 flag set. That's where things are going wrong. I don't understand how reading the file with <...> converts a 2-byte sequence into a single byte. At the time the file is read it should be being interpreted as raw bytes.

Do you have some ENV variable or Perl flag set somewhere that automatically enables UTF8 translations on all filehandles? However if you did, I'd expect the slurped-in string to have the utf8 flag set.

rwstauner commented 12 years ago

I have my LANG set to en_US.UTF-8. I do not set PERL_UNICODE in fear of running into issues like this. I've tried running the test with a few perl -C settings but that hasn't helped either

Running it with a bare env also doesn't seem to change anything.

rwstauner commented 12 years ago

5.16.1 doesn't help either

rwstauner commented 12 years ago

boo hoo i cry...

even my system perl with a local::lib full of deps has this problem (including corrupted tar)... wth?

rwstauner commented 12 years ago

@grantm, mind pasting your perl -V here? it's the last thing i can think of comparing before i give up. thanks.

grantm commented 12 years ago

This is a minimal test script which works for me and in theory will fail for you. It contains no encoding conversions.

#!/usr/bin/perl

use strict;
use warnings;

use Test::More;
use Path::Class qw(file);
use Devel::Peek;

my $changes_file = 'var/tmp/source/DOY/Moose-0.01/Moose-0.01/Changes';

my $path = file($changes_file);

my $body = $path->openr;

$body = do {
    local($/);
    $body = <$body>;
};

# Unicode "ü" = U+00FC  =  UTF8: \xC3\xBC
# Hex => Octal:  \xC3 => \303    \xBC => \274
like $body, qr/\303\274/, "File body contains UTF8 bytes"
    or Dump($body);

done_testing;
grantm commented 12 years ago

Here's my perl -V

Summary of my perl5 (revision 5 version 10 subversion 1) configuration:
  Commit id: 
  Platform:
    osname=linux, osvers=2.6.32-5-686, archname=i486-linux-gnu-thread-multi
    uname='linux callisto 2.6.32-5-686 #1 smp mon jun 13 04:13:06 utc 2011 i686 gnulinux '
    config_args='-Dusethreads -Duselargefiles -Dccflags=-DDEBIAN -Dcccdlflags=-fPIC -Darchname=i486-linux-gnu -Dprefix=/usr -Dprivlib=/usr/share/perl/5.10 -Darchlib=/usr/lib/perl/5.10 -Dvendorprefix=/usr -Dvendorlib=/usr/share/perl5 -Dvendorarch=/usr/lib/perl5 -Dsiteprefix=/usr/local -Dsitelib=/usr/local/share/perl/5.10.1 -Dsitearch=/usr/local/lib/perl/5.10.1 -Dman1dir=/usr/share/man/man1 -Dman3dir=/usr/share/man/man3 -Dsiteman1dir=/usr/local/man/man1 -Dsiteman3dir=/usr/local/man/man3 -Dman1ext=1 -Dman3ext=3perl -Dpager=/usr/bin/sensible-pager -Uafs -Ud_csh -Ud_ualarm -Uusesfio -Uusenm -DDEBUGGING=-g -Doptimize=-O2 -Duseshrplib -Dlibperl=libperl.so.5.10.1 -Dd_dosuid -des'
    hint=recommended, useposix=true, d_sigaction=define
    useithreads=define, usemultiplicity=define
    useperlio=define, d_sfio=undef, uselargefiles=define, usesocks=undef
    use64bitint=undef, use64bitall=undef, uselongdouble=undef
    usemymalloc=n, bincompat5005=undef
  Compiler:
    cc='cc', ccflags ='-D_REENTRANT -D_GNU_SOURCE -DDEBIAN -fno-strict-aliasing -pipe -fstack-protector -I/usr/local/include -D_LARGEFILE_SOURCE -D_FILE_OFFSET_BITS=64',
    optimize='-O2 -g',
    cppflags='-D_REENTRANT -D_GNU_SOURCE -DDEBIAN -fno-strict-aliasing -pipe -fstack-protector -I/usr/local/include'
    ccversion='', gccversion='4.4.5', gccosandvers=''
    intsize=4, longsize=4, ptrsize=4, doublesize=8, byteorder=1234
    d_longlong=define, longlongsize=8, d_longdbl=define, longdblsize=12
    ivtype='long', ivsize=4, nvtype='double', nvsize=8, Off_t='off_t', lseeksize=8
    alignbytes=4, prototype=define
  Linker and Libraries:
    ld='cc', ldflags =' -fstack-protector -L/usr/local/lib'
    libpth=/usr/local/lib /lib /usr/lib /usr/lib64
    libs=-lgdbm -lgdbm_compat -ldb -ldl -lm -lpthread -lc -lcrypt
    perllibs=-ldl -lm -lpthread -lc -lcrypt
    libc=/lib/libc-2.11.2.so, so=so, useshrplib=true, libperl=libperl.so.5.10.1
    gnulibc_version='2.11.2'
  Dynamic Linking:
    dlsrc=dl_dlopen.xs, dlext=so, d_dlsymun=undef, ccdlflags='-Wl,-E'
    cccdlflags='-fPIC', lddlflags='-shared -O2 -g -L/usr/local/lib -fstack-protector'

Characteristics of this binary (from libperl): 
  Compile-time options: MULTIPLICITY PERL_DONT_CREATE_GVSV
                        PERL_IMPLICIT_CONTEXT PERL_MALLOC_WRAP USE_ITHREADS
                        USE_LARGE_FILES USE_PERLIO USE_REENTRANT_API
  Locally applied patches:
        DEBPKG:debian/arm_thread_stress_timeout - http://bugs.debian.org/501970 Raise the timeout of ext/threads/shared/t/stress.t to accommodate slower build hosts
        DEBPKG:debian/cpan_config_path - Set location of CPAN::Config to /etc/perl as /usr may not be writable.
        DEBPKG:debian/cpan_definstalldirs - Provide a sensible INSTALLDIRS default for modules installed from CPAN.
        DEBPKG:debian/db_file_ver - http://bugs.debian.org/340047 Remove overly restrictive DB_File version check.
        DEBPKG:debian/doc_info - Replace generic man(1) instructions with Debian-specific information.
        DEBPKG:debian/enc2xs_inc - http://bugs.debian.org/290336 Tweak enc2xs to follow symlinks and ignore missing @INC directories.
        DEBPKG:debian/errno_ver - http://bugs.debian.org/343351 Remove Errno version check due to upgrade problems with long-running processes.
        DEBPKG:debian/extutils_hacks - Various debian-specific ExtUtils changes
        DEBPKG:debian/fakeroot - Postpone LD_LIBRARY_PATH evaluation to the binary targets.
        DEBPKG:debian/instmodsh_doc - Debian policy doesn't install .packlist files for core or vendor.
        DEBPKG:debian/ld_run_path - Remove standard libs from LD_RUN_PATH as per Debian policy.
        DEBPKG:debian/libnet_config_path - Set location of libnet.cfg to /etc/perl/Net as /usr may not be writable.
        DEBPKG:debian/m68k_thread_stress - http://bugs.debian.org/495826 Disable some threads tests on m68k for now due to missing TLS.
        DEBPKG:debian/mod_paths - Tweak @INC ordering for Debian
        DEBPKG:debian/module_build_man_extensions - http://bugs.debian.org/479460 Adjust Module::Build manual page extensions for the Debian Perl policy
        DEBPKG:debian/perl_synopsis - http://bugs.debian.org/278323 Rearrange perl.pod
        DEBPKG:debian/prune_libs - http://bugs.debian.org/128355 Prune the list of libraries wanted to what we actually need.
        DEBPKG:debian/use_gdbm - Explicitly link against -lgdbm_compat in ODBM_File/NDBM_File. 
        DEBPKG:fixes/assorted_docs - http://bugs.debian.org/443733 [384f06a] Math::BigInt::CalcEmu documentation grammar fix
        DEBPKG:fixes/net_smtp_docs - http://bugs.debian.org/100195 [rt.cpan.org #36038] Document the Net::SMTP 'Port' option
        DEBPKG:fixes/processPL - http://bugs.debian.org/357264 [rt.cpan.org #17224] Always use PERLRUNINST when building perl modules.
        DEBPKG:debian/perlivp - http://bugs.debian.org/510895 Make perlivp skip include directories in /usr/local
        DEBPKG:fixes/pod2man-index-backslash - http://bugs.debian.org/521256 Escape backslashes in .IX entries
        DEBPKG:debian/disable-zlib-bundling - Disable zlib bundling in Compress::Raw::Zlib
        DEBPKG:fixes/kfreebsd_cppsymbols - http://bugs.debian.org/533098 [3b910a0] Add gcc predefined macros to $Config{cppsymbols} on GNU/kFreeBSD.
        DEBPKG:debian/cpanplus_definstalldirs - http://bugs.debian.org/533707 Configure CPANPLUS to use the site directories by default.
        DEBPKG:debian/cpanplus_config_path - Save local versions of CPANPLUS::Config::System into /etc/perl.
        DEBPKG:fixes/kfreebsd-filecopy-pipes - http://bugs.debian.org/537555 [16f708c] Fix File::Copy::copy with pipes on GNU/kFreeBSD
        DEBPKG:fixes/anon-tmpfile-dir - http://bugs.debian.org/528544 [perl #66452] Honor TMPDIR when open()ing an anonymous temporary file
        DEBPKG:fixes/abstract-sockets - http://bugs.debian.org/329291 [89904c0] Add support for Abstract namespace sockets.
        DEBPKG:fixes/hurd_cppsymbols - http://bugs.debian.org/544307 [eeb92b7] Add gcc predefined macros to $Config{cppsymbols} on GNU/Hurd.
        DEBPKG:fixes/autodie-flock - http://bugs.debian.org/543731 Allow for flock returning EAGAIN instead of EWOULDBLOCK on linux/parisc
        DEBPKG:fixes/archive-tar-instance-error - http://bugs.debian.org/539355 [rt.cpan.org #48879] Separate Archive::Tar instance error strings from each other
        DEBPKG:fixes/positive-gpos - http://bugs.debian.org/545234 [perl #69056] [c584a96] Fix \G crash on first match
        DEBPKG:debian/devel-ppport-ia64-optim - http://bugs.debian.org/548943 Work around an ICE on ia64
        DEBPKG:fixes/trie-logic-match - http://bugs.debian.org/552291 [perl #69973] [0abd0d7] Fix a DoS in Unicode processing [CVE-2009-3626]
        DEBPKG:fixes/hppa-thread-eagain - http://bugs.debian.org/554218 make the threads-shared test suite more robust, fixing failures on hppa
        DEBPKG:fixes/crash-on-undefined-destroy - http://bugs.debian.org/564074 [perl #71952] [1f15e67] Fix a NULL pointer dereference when looking for a DESTROY method
        DEBPKG:fixes/tainted-errno - http://bugs.debian.org/574129 [perl #61976] [be1cf43] fix an errno stringification bug in taint mode
        DEBPKG:fixes/safe-upgrade - http://bugs.debian.org/582978 Upgrade Safe.pm to 2.25, fixing CVE-2010-1974
        DEBPKG:fixes/tell-crash - http://bugs.debian.org/578577 [f4817f3] Fix a tell() crash on bad arguments.
        DEBPKG:fixes/format-write-crash - http://bugs.debian.org/579537 [perl #22977] [421f30e] Fix a crash in format/write
        DEBPKG:fixes/arm-alignment - http://bugs.debian.org/289884 [f1c7503] Prevent gcc from optimizing the alignment test away on armel
        DEBPKG:fixes/fcgi-test - Fix a failure in CGI/t/fast.t when FCGI is installed
        DEBPKG:fixes/hurd-ccflags - http://bugs.debian.org/587901 Make hints/gnu.sh append to $ccflags rather than overriding them
        DEBPKG:debian/squelch-locale-warnings - http://bugs.debian.org/508764 Squelch locale warnings in Debian package maintainer scripts
        DEBPKG:fixes/lc-numeric-docs - http://bugs.debian.org/379329 [perl #78452] [903eb63] LC_NUMERIC documentation fixes
        DEBPKG:fixes/lc-numeric-sprintf - http://bugs.debian.org/601549 [perl #78632] [b3fd614] Fix sprintf not to ignore LC_NUMERIC with constants
        DEBPKG:fixes/concat-stack-corruption - http://bugs.debian.org/596105 [perl #78674] [e3393f5] Fix stack pointer corruption in pp_concat() with 'use encoding'
        DEBPKG:fixes/cgi-multiline-header - http://bugs.debian.org/606995 [CVE-2010-2761 CVE-2010-4410 CVE-2010-4411] CGI.pm MIME boundary and multiline header vulnerabilities
        DEBPKG:fixes/casing-taint-cve-2011-1487 - http://bugs.debian.org/622817 [perl #87336] fix unwanted taint laundering in lc(), uc() et al.
        DEBPKG:fixes/safe-reval-rdo-cve-2010-1447 - [PATCH] Wrap by default coderefs returned by rdo and reval
        DEBPKG:patchlevel - http://bugs.debian.org/567489 List packaged patches for 5.10.1-17squeeze2 in patchlevel.h
  Built under linux
  Compiled at Jun 30 2011 19:08:06
  %ENV:
    PERL5LIB="/home/grant/perl5/lib/perl5/i486-linux-gnu-thread-multi:/home/grant/perl5/lib/perl5"
    PERL_MM_OPT="INSTALL_BASE=/home/grant/perl5"
  @INC:
    /home/grant/perl5/lib/perl5/i486-linux-gnu-thread-multi
    /home/grant/perl5/lib/perl5/i486-linux-gnu-thread-multi
    /home/grant/perl5/lib/perl5
    /etc/perl
    /usr/local/lib/perl/5.10.1
    /usr/local/share/perl/5.10.1
    /usr/lib/perl5
    /usr/share/perl5
    /usr/lib/perl/5.10
    /usr/share/perl/5.10
    /usr/local/lib/site_perl
    .
rwstauner commented 12 years ago

that's not a 64bit perl.

Thanks for the test script. It failed as expected. It also pointed out to me half of my problem.

I was not deleting my var/tmp/source between runs (just t/var/tmp). So, now I get the tests to pass on 5.10.1 (when the tar is not corrupted).

I think that narrows down my problem to figuring out which module is corrupting my tar.

Another mildly interesting note is that file that works does not end with a newline, but the corrupted one does.

I apologize for the wild-goose-chase portion of all of this, but thank you very much for your help.

rwstauner commented 12 years ago

I don't know why it affects 5.14.2 and not 5.10.1 (somebody smarter than me probably does), but I'm pretty sure Module::Faker should be encoding something that it isn't, so i'll submit a bug report to rjbs.

rwstauner commented 12 years ago

https://github.com/rjbs/module-faker/pull/4