DecodedURL.to_uri is inconsistent with DecodedURL.normalize, .child, etc

glyph commented 3 years ago

I was pretty surprised to discover just now that DecodedURL.to_uri returns a URL object, rather than a DecodedURL. Given that almost all the other methods wrap the passed-through object, why does this one not?

I'm not sure how we get out of the compatibility jam if we decide this is wrong, but it seems wrong to me.

mahmoud commented 3 years ago

Yeah the wrapped surface area is not the clearest. to_uri, to_iri, and to_text are all passthroughs. to_text makes immediate sense to keep as a passthrough, and it's the one I use most. I'm less sure about the other two.

I'm curious, what were you planning on doing with the URL once you got it? The docstring of to_uri says

This is useful to do in preparation for sending a :class:URL over a network protocol.

But why one wouldn't go straight to to_text is less clear to me.

wsanchez commented 3 years ago

But why one wouldn't go straight to to_text is less clear to me.

Perhaps that's an argument to deprecate to_uri (and to_iri) altogether?

It would obliquely dodge the compatibility jam by removing the API instead of changing it in a way that's hard to reconcile.

glyph commented 3 years ago

@mahmoud to_text converts your URI to text, for reading by a human (or embedding in a web page or a tweet or whatever), in its exact current form.

to_uri converts your URL to an ascii-only representation, which can be used to synthesize the bytes required for an HTTP request, or for storage in a system which is going to make HTTP requests with a client that may not be as sophisticated as a browser, or as Hyperlink itself, or for anywhere you need something non-unicode-aware.

It converts to a URL object rather than text or bytes directly because you still might want to manipulate it in that state, or to inspect its differences.

I like what @wsanchez is saying here though, in the sense that we should probably have better names for these. At the time, neck-deep in protocol specifications, asURI() was the intuitively obvious choice here, but in the harsh light of the following day's second system syndrome, it's clear that this nomenclature is obscure.

I am struggling to come up with something even marginally succinct, and I hate these names, but just as a starting point for discussion: as_ascii_only_for_machines and as_readable_text_for_humans ?

glyph commented 3 years ago

I'm curious, what were you planning on doing with the URL once you got it? The docstring of to_uri says

This is useful to do in preparation for sending a :class:URL over a network protocol.

But why one wouldn't go straight to to_text is less clear to me.

It's true that I think this is a low-urgency issue precisely because 99.9% of uses of to_uri will be followed by an immediate to_text. (But that also implies that maybe we can get away with a compatibility exception.)

mahmoud commented 3 years ago

Interesting. Well, my first instinct is to add an as_ascii or ascii flag to to_text and then talk about the use cases in the docs a bit. 2c, etc.

wsanchez commented 3 years ago

I was going to suggest as_text and as_ascii for @glyph's long names.

wsanchez commented 3 years ago

And still deprecate as_uri. You can get the EncodedURL back out if you need that, but it seems we think you probably don't.

mahmoud commented 3 years ago

I like that, too.

glyph commented 3 years ago

Just wanted to point out that we should not have both "as_text" and "to_text" :). And we need more than a text-based manipulation here, the thing the methods do (give you a URL) is useful.

This is an interesting bikeshed, though. Another possible color: .asciify() / .readable() ?

glyph commented 3 years ago

Or ... .internationalize() ?

glyph commented 3 years ago

.for_display() / .for_wire()

glyph commented 3 years ago

Apparently the RFC calls them .ToASCII() and .ToUnicode() and maybe we should just follow suit: https://tools.ietf.org/html/rfc3490#section-4.1

glyph commented 3 years ago

(okay okay I'm actually done for now)

twm commented 3 years ago

The RFC names make more sense than the status quo to me

twm commented 3 years ago

Ooops, sorry about that. Tab order strikes again.

wsanchez commented 3 years ago

@glyph confused the hell out of me above. Oh, my bad. No, I don't was to add as_text; I was mis-remembering the name of to_text.

I mean to propose that we keep to_text as-is, which is the equivalent of the RFC's toUnicode, and add to_ascii for the equivalent of toASCII, and call it a day. Twisted-flavored names could be toText and toASCII, if we are still bothering with such things, which I would stop doing because it feels like busy-work.

(Which is to say that I think "text" is a fine synonym for "Unicode".)

glyph commented 3 years ago

@glyph confused the hell out of me above. Oh, my bad. No, I don't was to add as_text; I was mis-remembering the name of to_text.

Ah. Whew.

I mean to propose that we keep to_text as-is, which is the equivalent of the RFC's toUnicode, and add to_ascii for the equivalent of toASCII, and call it a day.

I'm fairly certain that the equivalent of the RFC's ToUnicode is .as_iri(), and ToAscii is .as_uri(). Converting to/from actual python strings doesn't ever do character conversion.

Twisted-flavored names could be toText and toASCII, if we are still bothering with such things, which I would stop doing because it feels like busy-work.

I think the idea of the twisted-style names was for compatibility only. I would not suggest adding both on any new functionality.

(Which is to say that I think "text" is a fine synonym for "Unicode".)

Normally I'd agree but there are a number of confounding nuances here. One is that we aren't talking about literal Python types, but rather encoding notations, and also that "unicode" is specifically in contrast to "ascii" in this case... which is also text. Just different text.

glyph commented 3 years ago

Re-reading this I realized I didn't quite answer @mahmoud's question above:

I'm curious, what were you planning on doing with the URL once you got it? The docstring of to_uri says

This is useful to do in preparation for sending a :class:URL over a network protocol.

But why one wouldn't go straight to to_text is less clear to me.

to_text doesn't actually work for what you would need in a protocol library. Consider the following.

When you issue a request for https://ايران.example.com/foo⇧bar/, your (HTTP/1.1, gosh I don't even know what these fancy new protocols look like) bytes for "Please retrieve this" look like:

GET /foo%E2%87%A7bar/
Host: xn--mgba3a4fra.example.com

This requires that the HTTP protocol implementation disassemble the URL into its constituent parts, then decode various different bits of it as ASCII; the hostname, the path, the query, and not the fragment. And of course, to issue a matching DNS query and TLS certificate validation. But only the .to_iri() version.

On the flip side, if the HTTP implementation wants to take those bytes and give you the answer to "what URL was requested", it has to do this same process in reverse. Now, arguably, the server should hand you a totally pristine URL object, as unmodified as possible from the wire; but you may want to read some decoded unicode stuff out of its constituent parts as well, so the .to_iri() (or ToUnicode) normalization needs to exist in the other direction.

There's a case to be made, I guess, that DecodedURL should always operate as if it's an IRI as you access its data... it will do this at least in part.

P.S.: While I was attempting to construct an example URI to comment on this issue, I happened across this web page, which causes a lot of problems with Hyperlink due to this bug. Hyperlink's behavior is arguably even worse than IDNA's, since we let you construct the "invalid" (not really invalid) URL, but then we don't let you convert it.

python-hyper / hyperlink

DecodedURL.to_uri is inconsistent with DecodedURL.normalize, .child, etc #144