Open glyph opened 3 years ago
Yeah the wrapped surface area is not the clearest. to_uri
, to_iri
, and to_text
are all passthroughs. to_text
makes immediate sense to keep as a passthrough, and it's the one I use most. I'm less sure about the other two.
I'm curious, what were you planning on doing with the URL once you got it? The docstring of to_uri
says
This is useful to do in preparation for sending a :class:
URL
over a network protocol.
But why one wouldn't go straight to to_text
is less clear to me.
But why one wouldn't go straight to
to_text
is less clear to me.
Perhaps that's an argument to deprecate to_uri
(and to_iri
) altogether?
It would obliquely dodge the compatibility jam by removing the API instead of changing it in a way that's hard to reconcile.
@mahmoud to_text
converts your URI to text, for reading by a human (or embedding in a web page or a tweet or whatever), in its exact current form.
to_uri
converts your URL to an ascii-only representation, which can be used to synthesize the bytes required for an HTTP request, or for storage in a system which is going to make HTTP requests with a client that may not be as sophisticated as a browser, or as Hyperlink itself, or for anywhere you need something non-unicode-aware.
It converts to a URL object rather than text or bytes directly because you still might want to manipulate it in that state, or to inspect its differences.
I like what @wsanchez is saying here though, in the sense that we should probably have better names for these. At the time, neck-deep in protocol specifications, asURI()
was the intuitively obvious choice here, but in the harsh light of the following day's second system syndrome, it's clear that this nomenclature is obscure.
I am struggling to come up with something even marginally succinct, and I hate these names, but just as a starting point for discussion: as_ascii_only_for_machines
and as_readable_text_for_humans
?
I'm curious, what were you planning on doing with the URL once you got it? The docstring of
to_uri
saysThis is useful to do in preparation for sending a :class:
URL
over a network protocol.But why one wouldn't go straight to
to_text
is less clear to me.
It's true that I think this is a low-urgency issue precisely because 99.9% of uses of to_uri
will be followed by an immediate to_text
. (But that also implies that maybe we can get away with a compatibility exception.)
Interesting. Well, my first instinct is to add an as_ascii
or ascii
flag to to_text
and then talk about the use cases in the docs a bit. 2c, etc.
I was going to suggest as_text
and as_ascii
for @glyph's long names.
And still deprecate as_uri
. You can get the EncodedURL
back out if you need that, but it seems we think you probably don't.
I like that, too.
Just wanted to point out that we should not have both "as_text" and "to_text" :). And we need more than a text-based manipulation here, the thing the methods do (give you a URL
) is useful.
This is an interesting bikeshed, though. Another possible color: .asciify()
/ .readable()
?
Or ... .internationalize()
?
.for_display()
/ .for_wire()
Apparently the RFC calls them .ToASCII()
and .ToUnicode()
and maybe we should just follow suit: https://tools.ietf.org/html/rfc3490#section-4.1
(okay okay I'm actually done for now)
The RFC names make more sense than the status quo to me
Ooops, sorry about that. Tab order strikes again.
@glyph confused the hell out of me above. Oh, my bad. No, I don't was to add as_text
; I was mis-remembering the name of to_text
.
I mean to propose that we keep to_text
as-is, which is the equivalent of the RFC's toUnicode
, and add to_ascii
for the equivalent of toASCII
, and call it a day. Twisted-flavored names could be toText
and toASCII
, if we are still bothering with such things, which I would stop doing because it feels like busy-work.
(Which is to say that I think "text" is a fine synonym for "Unicode".)
@glyph confused the hell out of me above. Oh, my bad. No, I don't was to add
as_text
; I was mis-remembering the name ofto_text
.
Ah. Whew.
I mean to propose that we keep
to_text
as-is, which is the equivalent of the RFC'stoUnicode
, and addto_ascii
for the equivalent oftoASCII
, and call it a day.
I'm fairly certain that the equivalent of the RFC's ToUnicode
is .as_iri()
, and ToAscii
is .as_uri()
. Converting to/from actual python strings doesn't ever do character conversion.
Twisted-flavored names could be
toText
andtoASCII
, if we are still bothering with such things, which I would stop doing because it feels like busy-work.
I think the idea of the twisted-style names was for compatibility only. I would not suggest adding both on any new functionality.
(Which is to say that I think "text" is a fine synonym for "Unicode".)
Normally I'd agree but there are a number of confounding nuances here. One is that we aren't talking about literal Python types, but rather encoding notations, and also that "unicode" is specifically in contrast to "ascii" in this case... which is also text. Just different text.
Re-reading this I realized I didn't quite answer @mahmoud's question above:
I'm curious, what were you planning on doing with the URL once you got it? The docstring of
to_uri
saysThis is useful to do in preparation for sending a :class:
URL
over a network protocol.But why one wouldn't go straight to
to_text
is less clear to me.
to_text
doesn't actually work for what you would need in a protocol library. Consider the following.
When you issue a request for https://ايران.example.com/foo⇧bar/
, your (HTTP/1.1, gosh I don't even know what these fancy new protocols look like) bytes for "Please retrieve this" look like:
GET /foo%E2%87%A7bar/
Host: xn--mgba3a4fra.example.com
This requires that the HTTP protocol implementation disassemble the URL into its constituent parts, then decode various different bits of it as ASCII; the hostname, the path, the query, and not the fragment. And of course, to issue a matching DNS query and TLS certificate validation. But only the .to_iri()
version.
On the flip side, if the HTTP implementation wants to take those bytes and give you the answer to "what URL was requested", it has to do this same process in reverse. Now, arguably, the server should hand you a totally pristine URL object, as unmodified as possible from the wire; but you may want to read some decoded unicode stuff out of its constituent parts as well, so the .to_iri()
(or ToUnicode
) normalization needs to exist in the other direction.
There's a case to be made, I guess, that DecodedURL
should always operate as if it's an IRI as you access its data... it will do this at least in part.
P.S.: While I was attempting to construct an example URI to comment on this issue, I happened across this web page, which causes a lot of problems with Hyperlink due to this bug. Hyperlink's behavior is arguably even worse than IDNA's, since we let you construct the "invalid" (not really invalid) URL, but then we don't let you convert it.
I was pretty surprised to discover just now that
DecodedURL.to_uri
returns aURL
object, rather than aDecodedURL
. Given that almost all the other methods wrap the passed-through object, why does this one not?I'm not sure how we get out of the compatibility jam if we decide this is wrong, but it seems wrong to me.