kasei / perl-iri

Perl implementation of Internationalized Resource Identifiers (IRIs)
0 stars 6 forks source link

please document difference from URI->as_iri #2

Open jonassmedegaard opened 9 years ago

jonassmedegaard commented 9 years ago

Today I noticed URI->as_iri and wonder what is the benefit of this IRI module. Code indicates it is deliberately separate from URI, and I guess it might more strictly follow the referenced RFC where perhaps URI is more relaxed.

Rather than asking on IRC to please my own curiosilty, I imagine others might benefit from such clarification too, hence opening as an issue here :-)

Here's the adaptation I did to IRI SYNOPSIS to verify that indeed URI-> can at least superficially mimic that sample use case:

use URI;

my $i = URI->new('https://example.org:80/index#frag');
say $i->scheme; # 'https'
say $i->path; # '/index'

my $base = URI->new("http://www.hestebedg\x{e5}rd.dk/");
my $j = URI->new('#frag');
say $j->abs($base)->as_iri; # 'http://www.hestebedgård.dk/#frag'
kasei commented 9 years ago

I'll write up some explanation in the docs, but briefly, URI doesn't fully support unicode end-to-end. The fact that it looks right in your example is probably because your terminal is expecting utf8. $j->abs($base)->as_iri is returning a utf8 encoded byte string, not an IRI. Here's a better example:

use utf8;
use URI;
use IRI;
use Devel::Peek;

my $value   = "http://www.hestebedg\x{e5}rd.dk/#frag";
utf8::upgrade($value);
print STDERR "Raw value: ";
Dump($value);

my $uri = URI->new($value);
print STDERR "URI as_iri: ";
Dump($uri->as_iri);

my $iri = IRI->new($value);
print STDERR "IRI as_string: ";
Dump($iri->as_string);

which outputs:

Raw value: SV = PV(0x7ff3c2805270) at 0x7ff3c282e7b8
  REFCNT = 1
  FLAGS = (PADMY,POK,pPOK,UTF8)
  PV = 0x7ff3c2467520 "http://www.hestebedg\303\245rd.dk/#frag"\0 [UTF8 "http://www.hestebedg\x{e5}rd.dk/#frag"]
  CUR = 33
  LEN = 64
URI as_iri: SV = PVMG(0x7ff3c2c1d610) at 0x7ff3c2f4c180
  REFCNT = 1
  FLAGS = (TEMP,POK,pPOK)
  IV = 0
  NV = 0
  PV = 0x7ff3c25f43f0 "http://www.hestebedg\345rd.dk/#frag"\0
  CUR = 32
  LEN = 48
IRI as_string: SV = PV(0x7ff3c2eb2100) at 0x7ff3c2f06568
  REFCNT = 1
  FLAGS = (TEMP,POK,IsCOW,pPOK,UTF8)
  PV = 0x7ff3c2467520 "http://www.hestebedg\303\245rd.dk/#frag"\0 [UTF8 "http://www.hestebedg\x{e5}rd.dk/#frag"]
  CUR = 33
  LEN = 64
  COW_REFCNT = 4

Both the raw input value and the IRI->as_string method result in a unicode string, while the URI->as_iri method returns a byte sequence that was produced by utf8 encoding.

jonassmedegaard commented 9 years ago

Quoting Gregory Todd Williams (2014-12-24 20:40:26)

I'll write up some explanation in the docs, but briefly, URI doesn't fully support unicode end-to-end. The fact that it looks right in your example is probably because your terminal is expecting utf8. $j->abs($base)->as_iri is returning a utf8 encoded byte string, not an IRI. Here's a better example:

Thanks for both the explanation and the example (I shall try remember Devel::Peek next time I have Unicode trouble). And thanks for IRI! :-)

kasei commented 9 years ago

That being said, there are cases where a unicode string will pop out of as_iri. It's the inconsistency that's the problem if you need to rely on proper unicode handling for IRIs.