Closed wsanchez closed 7 years ago
It seems as if b
is wrong here, and the %
character should have been encoded.
That is, b.asText()
should return '%253F'
, no?
It certainly feels like you're onto something. I can kind of explain why it does what it does, but I might want to wait til I'm in a less jet lagged state. I believe this behavior is inherited from t.p.URL, so @glyph may also be a good explainer in the meantime.
Here's a go anyways: the constructor, working from parts, is meant to work equally well with URIs and IRIs. Virtually no decoding happens in the constructor. When converting to text, minimal encoding is used. Percents are certainly valid in the path, and are left alone. A question mark is reserved for the query string, and is encoded. In short (and crudely), one is a URI and one is an IRI, according to this simplified explanation in the docs.
I hope I'm right and that makes sense, but am certainly open to correction. It's definitely good to explore these APIs and maybe add some better docs/FAQs, so thank you!
Basically the idea here is that originally Twisted had a notion of the "real" value and the "encoded" value of a URL path segment; the "real" value had all the %xx values decoded and the "encoded" version was escaped.
Unfortunately this is overly simplistic, because of course a human being could type an "encoded" URL into a browser bar, and then that is the text that they wanted to see. Such values can be "down"-converted into ASCII-only machine-readable things that can be sliced up to go onto the wire ("URI") or "up"-converted into textual human-readable things that might be in a language other than English, and/or include emojis 🐮.
The problem with characters like # and ? is that, while it is possible to "up"-convert a %3F
into a literal question mark — for example, if you wanted to encode something a user typed into a URL segment and then get it back out again in application code — it is not possible to reflect this distinction in the stringified representation of the URL itself.
So the current behavior is an accident of the implementation, but without some further thought, also kind of inevitable? I would be interested in how we might preserve both behaviors.
I see two issues here:
The behavior in the example and noted by @glyph above looks like a bug to me, and should raise an exception if ?
or #
are passed in -- they are not valid path
elements. None of the RFC gen-delims are valid, and should cause an exception.
I recently ran into a similar issue in w3lib's url canonicalizer (formerly scrapy's). their existing function standardizes a %23
to a #
, which turns the rest of the path AND the query string into a fragment -- substantially changing the URL.
There is room for debate on how to handle the various sub-delims that can either appear in the path or be percent encoded -- but ?
and #
are not valid characters for the path
in a URI or IRI. It's worth reading the RFC's section on path
about 10 times in a row, because you miss something every other read. (at least i do)
A use case for the sub-delims is here:
a = URL(path=(u",",))
b = URL(path=(u"%2C",))
But what happens about this URL?
c = URL(path=(u"a,%2C,z",))
You can canonicalize it for fingerprinting as:
a%2C%2C%2Cz
But that has a fundamentally different meaning to a server than:
a,%2C,z
And for fun, how should this be handled?
a = URL(path=(u"%",))
b = URL(path=(u"%25",))
IMHO, the safest/smartest bit is to require all input to be percent-encoded FOR THE SUB-DELIMS. You can keep a display version of the 'unescaped', but the only way to ensure you know what a URL intends is to be explicit on the construction.
edit: added "for the sub-delims" in the last paragraph.
Hey Jonathan! These are some good points and I see where you're coming from, but I recommend reviewing this bit of hyperlink design.
To summarize here, Hyperlink is meant to represent both escaped and unescaped versions. Interpretation of the constituent parts is based on usage. If you call .to_uri()
, you're telling hyperlink to interpret the URL as a display-oriented IRI. Same goes for .to_iri()
and URIs.
This contrasts with boltons' urlutils which fully decodes inputs, an approach which is less powerful because it doesn't allow for mixed representations (which are quite common in practice).
There are bound to be some corner cases and I'm glad we're working through them. A lot of these scenarios don't seem to have right answers, but I'm still confident with a bit more thought we will arrive at some pleasantly intuitive behaviors :)
I understand your points. The points I'm trying to convey are not coming across and a bit nuanced, so I'll try again (if you'll put up with me). I've had my head buried in these RFCs and edge-cases for the past month, so I've got a few of these concepts dancing in my head and keeping the right words from forming.
#
and ?
should never be valid within the path
argument as they could never appear unencoded in the path
-- decoding them would inherently change the URI.I'm not talking about creating a url object from a full url string, but specifically using the keyword 'path' argument to the constructor (as in the example above). The 'human'/browser-bar representation will always have these two elements encoded as part of the path, because they would signify other components to the URL structure if decoded and necessarily point to a resource that is different.
For example, consider this URI
http://example.com/path/to/foo/%23/bar/biz?bang=foo&bash=widget#topbar
The path must always the following, and can never be decoded in a browser bar:
example.com/path/to/foo/%23/bar/biz
Unescaping the %23
will turn everything after the decoded #
into a URI fragment:
example.com/path/to/foo/#/bar/biz?bang=foo&bash=widget#topbar
The same holds true for ?
. If decoded within a path, it signifies the query string. If either of those elements were encountered as "input" to the path
argument, you'd always display the encoded version. If someone were to submit either decoded character as part of the path
, they are most likely dealing with a bug or using the library wrong.
Using the examples in the docs, these two URLs will essentially always mean the same thing - so you can easily change context between them without concern. It doesn't matter how this is represented.
http://example.com/café'
http://example.com/caf%C3%A9'
However, the RFC sub-delims and the percent sign are not guaranteed to be the same...
!$&'()*+,;= %
I don't have an answer or a proposal for those characters. That's where the bulk of edge cases and exploration will be -- as there is a lot of mixed usage in there. IMHO, the input should be explicitly marked as encoded or not -- because they can shift the 'meaning' and intent of the URL.
It looks like the library is doing the right thing by those characters and leaving them as-is / not recoding them.
I see what you mean, and yes, enhancing the constructor behavior is seeming more and more like the way to go forward. I mentioned that over here, and I'm glad someone else with their head in the RFCs is kind of independently feeling that angle, too.
I'm still bothered by the fact that the path
argument will accept both encoded and unencoded input. It seems wrong to accept already encoded inputs here.
I think @glyph is saying that it's desirable for str(URL.fromText(X)) to always give you back X and not a normalized form of X. I think that's nuts, because (a) as @glyph said, it's not possible and (b) this bug is the result of trying to do the impossible.
If you want the original string, save it somewhere; this isn't a string object.
I guess what I mean is that if URL.fromText
needs to behave like a browser would, the associated weirdness should live there, not in __init__
.
I'm still bothered by the fact that the path argument will accept both encoded and unencoded input. It seems wrong to accept already encoded inputs here.
The problem is that "encoded" and "unencoded" are ambiguous terms here. http://example.com/%FF
is a perfectly valid URI, with a perfectly valid textual representation. However, it is simultaneously an un-decodable IRI which has no valid transformation and also a perfectly valid IRI that is equivalent to the URI.
There are also representations of mixed validity, like http://example.com/%FF/%C3%A9/
which can be rendered either as-is or as http://example.com/%FF/é
.
You can also have validation failures going in the other direction, with un-representable / invalid IDNA names in the domain part.
Is part of the problem here that we're using the same object for URI and IRI? Seem they have different rules, so maybe shouldn't be the same…
@wsanchez URIs and IRIs are both URLs. Because mixed URLs occur in practice, it's not just the rules that overlap. I think it's quite useful and powerful to pursue this hybrid approach.
There are also representations of mixed validity, like http://example.com/%FF/%C3%A9/ which can be rendered either as-is or as http://example.com/%FF/é.
In these cases of mixed-validity, an input of http://example.com/%FF/é
implicitly states the URL has already been decoded because of the presence of é
-- therefore the %FF
was decoded from %25FF
. While %FF
doesn't decode to an ASCII element, it does decode to a (likely-unused and unsupported) character and would be rendered with a placeholder. (e.g. Python decodes it to \xff
, browser bar shows unsupported character)
So I would interpret these as distinct resources:
`/%FF/%C3%A9` - unknown decoding status
`/%FF/é` - elements have been decoded from `/%25FF/%C3%A9`
`/\xff/é` - elements have been decoded from `/%FF/%C3%A9`
So, putting aside the design decision to support mixed encoding states, I think I have just committed a solution that fixes this issue (as well as #8 and countless other potential problems) while staying true to URL's unified design.
In short, I add a bunch of checking for delimiters to various parameters. Some characters simply must always be encoded in some contexts, a fact that is made clear from the discussion above, not to mention the maximal
descriptor for the encoding functions. The URL constructor now checks parameters for these delimiters and raises ValueError
accordingly.
Let me know if you have any thoughts!
How do I add a URL parameter that contains #
? I can't do:
someURL = URL()
someURL.add(u'parameter', u'#value').to_text()
because it fails with:
Traceback (most recent call last):
File "<stdin>", line 1, in <module>
File "hyperlink/_url.py", line 1363, in add
return self.replace(query=self.query + ((name, value),))
File "hyperlink/_url.py", line 937, in replace
uses_netloc=_optional(uses_netloc, self.uses_netloc)
File "hyperlink/_url.py", line 698, in __init__
for (k, v) in query
File "hyperlink/_url.py", line 698, in <genexpr>
for (k, v) in query
File "hyperlink/_url.py", line 433, in _textcheck
% (''.join(delims), name, value))
ValueError: one or more reserved delimiters &# present in query parameter value: u'#value'
But with older versions I can do:
someURL.add(u'parameter', u'#value').to_text()
# u'?parameter=%23value'
@markrwilliams See #44, #11 .
I can create two URLs that render as the same text but are not equal: