(what is a source of confusion is that the argument is called "string' when it
should be called "octets_stream" or something. A string is an array of Unicode
characters, whereas an XML serialization is an array of semanticless bytes,
which can become characters given the choice of an encoding, and that is what
an XML document serialization is).
does not return consistent results (and even fails sometimes), considering the Cartesian product of:
choice of the XML preamble's encoding (iso-8859-1 or utf-8)
choice of the internal representation of the scalar in Perl (flag utf8::is_utf8 is on or off)
You will find hereafter a Perl program using the XSD schema above and demonstrating this (I cannot drag'n'drop on my platform):
$XML_U8 has utf-8 preamble and utf8 internal representation; all works well
$XML_L1 has latin-1 preamble and raw internal representation; all works well
However,
$xml_l1 has latin-1 preamble and utf8 internal representation; XML::LibXML fails to (re)serialise it as it was and therefore XML::LibXML::Schema fails to validate it
$xml_u8 has utf-8 preamble and raw internal representation; XML::LibXML cannot even load it as an XML document
Perl's behaviour should not depend on the internal representation of scalars (perldoc utf8).
Regards,
File validate.pl (UTF-8):
#! /usr/bin/perl
use utf8;
use warnings;
use strict;
use feature 'say';
use Carp;
use Data::Dumper;
use English qw( -no_match_vars );
use File::Slurp qw( read_file );
use Term::ANSIColor qw( colored );
use XML::LibXML;
use Readonly;
Readonly my $XML_U8 => '<?xml version="1.0" encoding="utf-8"?><place>café</place>';
Readonly my $XML_L1 => "<?xml version='1.0' encoding='iso-8859-1'?><place>caf\x{e9}</place>";
Readonly my $XSD_FILE => 'schema.xsd';
sub dumper {
my ($str) = @_;
my $res = Dumper($str);
$res =~ s{\$VAR1 \s+ = \s+ (.*) ;\s*$}{$1}sx;
$res =~ s{([^\x00-\x7e])}{colored((sprintf '[0x%02x]', ord $1), 'bold red')}gsex;
return $res;
}
sub validation_error {
say STDERR "VALIDATION ERROR: " . Dumper($EVAL_ERROR);
}
sub validate {
my ($xsd_file, $xml_string) = @_;
say STDERR sprintf "\nValidating [%s]...", dumper($xml_string);
my ($schema, $document);
eval { $schema = XML::LibXML::Schema->new( location => $xsd_file ); 1 } or return validation_error();
eval { $document = XML::LibXML->load_xml( string => $xml_string ); 1 } or return validation_error();
say STDERR sprintf "XML parsed document reserialized: %s\n", dumper($document->serialize());
eval { $schema->validate($document); 1 } or return validation_error();
my $utf8_flag = utf8::is_utf8($xml_string) ? 1 : 0;
say STDERR sprintf "xml_string [%s] -- is_utf8? [%s] -- validates with respect to [%s]", dumper($xml_string), $utf8_flag, $xsd_file;
return;
}
validate($XSD_FILE, $XML_U8);
validate($XSD_FILE, $XML_L1);
my $xml_l1 = $XML_L1;
utf8::upgrade($xml_l1);
validate($XSD_FILE, $xml_l1);
my $xml_u8 = $XML_U8;
utf8::downgrade($xml_u8);
validate($XSD_FILE, $xml_u8);
exit 0;
Hello,
I like to hangout in cafés or pubs. Hence this XSD:
File schema.xsd (ASCII):
What matters is the fact the word café has a non-ASCII character in it.
This is a valid XML document for this XSD (at least xmllint --schema agrees on that):
where the "é" character is coded as bytes 0xc3 0xa9:
This is another valid XML document with another choice of encoding in the preamble:
where the "é" character is coded as byte 0xe9:
However, this code
(what is a source of confusion is that the argument is called "string' when it should be called "octets_stream" or something. A string is an array of Unicode characters, whereas an XML serialization is an array of semanticless bytes, which can become characters given the choice of an encoding, and that is what an XML document serialization is).
does not return consistent results (and even fails sometimes), considering the Cartesian product of:
You will find hereafter a Perl program using the XSD schema above and demonstrating this (I cannot drag'n'drop on my platform):
However,
Perl's behaviour should not depend on the internal representation of scalars (perldoc utf8).
Regards,
File validate.pl (UTF-8):