shlomif / perl-XML-LibXML

The XML-LibXML CPAN Distribution for Processing XML using the libxml2 library
https://metacpan.org/release/XML-LibXML
Other
17 stars 35 forks source link

XML::LibXML->load_xml( string => ...) fails to build a correct object depending on Perl's scalar internal representation #72

Open sblondeel opened 1 year ago

sblondeel commented 1 year ago

Hello,

I like to hangout in cafés or pubs. Hence this XSD:

File schema.xsd (ASCII):

<xsd:schema xmlns:xsd="http://www.w3.org/2001/XMLSchema">
  <xsd:element name="place">
    <xsd:simpleType>
      <xsd:restriction base="xsd:string">
        <xsd:enumeration value="caf&#xe9;" />
        <xsd:enumeration value="pub" />
      </xsd:restriction>
    </xsd:simpleType>
  </xsd:element>
</xsd:schema>

What matters is the fact the word café has a non-ASCII character in it.

This is a valid XML document for this XSD (at least xmllint --schema agrees on that):

<?xml version="1.0" encoding="utf-8"?><place>café</place>

where the "é" character is coded as bytes 0xc3 0xa9:

$ hexdump -C doc-u8.xml 
00000000  3c 3f 78 6d 6c 20 76 65  72 73 69 6f 6e 3d 22 31  |<?xml version="1|
00000010  2e 30 22 20 65 6e 63 6f  64 69 6e 67 3d 22 75 74  |.0" encoding="ut|
00000020  66 2d 38 22 3f 3e 3c 70  6c 61 63 65 3e 63 61 66  |f-8"?><place>caf|
00000030  c3 a9 3c 2f 70 6c 61 63  65 3e 0a                 |..</place>.|
0000003b

This is another valid XML document with another choice of encoding in the preamble:

<?xml version="1.0" encoding="iso-8859-1"?><place>café</place>

where the "é" character is coded as byte 0xe9:

$ hexdump -C doc-l1.xml 
00000000  3c 3f 78 6d 6c 20 76 65  72 73 69 6f 6e 3d 22 31  |<?xml version="1|
00000010  2e 30 22 20 65 6e 63 6f  64 69 6e 67 3d 22 69 73  |.0" encoding="is|
00000020  6f 2d 38 38 35 39 2d 31  22 3f 3e 3c 70 6c 61 63  |o-8859-1"?><plac|
00000030  65 3e 63 61 66 e9 3c 2f  70 6c 61 63 65 3e 0a     |e>caf.</place>.|
0000003f

However, this code

$document = XML::LibXML->load_xml( string => $xml_string );
print $document->serialize();

(what is a source of confusion is that the argument is called "string' when it should be called "octets_stream" or something. A string is an array of Unicode characters, whereas an XML serialization is an array of semanticless bytes, which can become characters given the choice of an encoding, and that is what an XML document serialization is).

does not return consistent results (and even fails sometimes), considering the Cartesian product of:

You will find hereafter a Perl program using the XSD schema above and demonstrating this (I cannot drag'n'drop on my platform):

However,

Perl's behaviour should not depend on the internal representation of scalars (perldoc utf8).

Regards,

File validate.pl (UTF-8):

#! /usr/bin/perl
use utf8;
use warnings;
use strict;
use feature 'say';
use Carp;
use Data::Dumper;
use English qw( -no_match_vars );
use File::Slurp qw( read_file );
use Term::ANSIColor qw( colored );
use XML::LibXML;
use Readonly;

Readonly my $XML_U8   => '<?xml version="1.0" encoding="utf-8"?><place>café</place>';
Readonly my $XML_L1   => "<?xml version='1.0' encoding='iso-8859-1'?><place>caf\x{e9}</place>";
Readonly my $XSD_FILE => 'schema.xsd';

sub dumper {
  my ($str) = @_;
  my $res = Dumper($str);
  $res =~ s{\$VAR1 \s+ = \s+ (.*) ;\s*$}{$1}sx;
  $res =~ s{([^\x00-\x7e])}{colored((sprintf '[0x%02x]', ord $1), 'bold red')}gsex;
  return $res;
}

sub validation_error {
  say STDERR "VALIDATION ERROR: " . Dumper($EVAL_ERROR);
}

sub validate {
  my ($xsd_file, $xml_string) = @_;

  say STDERR sprintf "\nValidating [%s]...", dumper($xml_string);

  my ($schema, $document);

  eval { $schema = XML::LibXML::Schema->new( location => $xsd_file ); 1 } or return validation_error();

  eval { $document = XML::LibXML->load_xml( string => $xml_string ); 1 } or return validation_error();
  say STDERR sprintf "XML parsed document reserialized: %s\n", dumper($document->serialize());

  eval { $schema->validate($document); 1 } or return validation_error();

  my $utf8_flag = utf8::is_utf8($xml_string) ? 1 : 0;

  say STDERR sprintf "xml_string [%s] -- is_utf8? [%s] -- validates with respect to [%s]", dumper($xml_string), $utf8_flag, $xsd_file;
  return;
}

validate($XSD_FILE, $XML_U8);

validate($XSD_FILE, $XML_L1);

my $xml_l1 = $XML_L1;
utf8::upgrade($xml_l1);
validate($XSD_FILE, $xml_l1);

my $xml_u8 = $XML_U8;
utf8::downgrade($xml_u8);
validate($XSD_FILE, $xml_u8);

exit 0;