python-hyper / hyperlink

🔗 Immutable, Pythonic, correct URLs.
https://hyperlink.readthedocs.io/
Other
286 stars 41 forks source link

Two URLs that render as the same text but are not equal #6

Closed wsanchez closed 7 years ago

wsanchez commented 7 years ago

I can create two URLs that render as the same text but are not equal:

>>> from hyperlink import URL
>>> a = URL(path=(u"?",))
>>> b = URL(path=(u"%3F",))
>>> a.asText()
u'%3F'
>>> b.asText()
u'%3F'
>>> a.asText() == b.asText()
True
>>> a == b
False
wsanchez commented 7 years ago

It seems as if b is wrong here, and the % character should have been encoded.

That is, b.asText() should return '%253F', no?

mahmoud commented 7 years ago

It certainly feels like you're onto something. I can kind of explain why it does what it does, but I might want to wait til I'm in a less jet lagged state. I believe this behavior is inherited from t.p.URL, so @glyph may also be a good explainer in the meantime.

Here's a go anyways: the constructor, working from parts, is meant to work equally well with URIs and IRIs. Virtually no decoding happens in the constructor. When converting to text, minimal encoding is used. Percents are certainly valid in the path, and are left alone. A question mark is reserved for the query string, and is encoded. In short (and crudely), one is a URI and one is an IRI, according to this simplified explanation in the docs.

I hope I'm right and that makes sense, but am certainly open to correction. It's definitely good to explore these APIs and maybe add some better docs/FAQs, so thank you!

glyph commented 7 years ago

Basically the idea here is that originally Twisted had a notion of the "real" value and the "encoded" value of a URL path segment; the "real" value had all the %xx values decoded and the "encoded" version was escaped.

Unfortunately this is overly simplistic, because of course a human being could type an "encoded" URL into a browser bar, and then that is the text that they wanted to see. Such values can be "down"-converted into ASCII-only machine-readable things that can be sliced up to go onto the wire ("URI") or "up"-converted into textual human-readable things that might be in a language other than English, and/or include emojis 🐮.

glyph commented 7 years ago

The problem with characters like # and ? is that, while it is possible to "up"-convert a %3F into a literal question mark — for example, if you wanted to encode something a user typed into a URL segment and then get it back out again in application code — it is not possible to reflect this distinction in the stringified representation of the URL itself.

glyph commented 7 years ago

So the current behavior is an accident of the implementation, but without some further thought, also kind of inevitable? I would be interested in how we might preserve both behaviors.

jvanasco commented 7 years ago

I see two issues here:

  1. handling ?/#
  2. handling percent-encodable path elements

The behavior in the example and noted by @glyph above looks like a bug to me, and should raise an exception if ? or # are passed in -- they are not valid path elements. None of the RFC gen-delims are valid, and should cause an exception.

I recently ran into a similar issue in w3lib's url canonicalizer (formerly scrapy's). their existing function standardizes a %23 to a #, which turns the rest of the path AND the query string into a fragment -- substantially changing the URL.

There is room for debate on how to handle the various sub-delims that can either appear in the path or be percent encoded -- but ? and # are not valid characters for the path in a URI or IRI. It's worth reading the RFC's section on path about 10 times in a row, because you miss something every other read. (at least i do)

A use case for the sub-delims is here:

 a = URL(path=(u",",))
 b = URL(path=(u"%2C",))

But what happens about this URL?

 c = URL(path=(u"a,%2C,z",))

You can canonicalize it for fingerprinting as:

 a%2C%2C%2Cz

But that has a fundamentally different meaning to a server than:

 a,%2C,z

And for fun, how should this be handled?

 a = URL(path=(u"%",))
 b = URL(path=(u"%25",))

IMHO, the safest/smartest bit is to require all input to be percent-encoded FOR THE SUB-DELIMS. You can keep a display version of the 'unescaped', but the only way to ensure you know what a URL intends is to be explicit on the construction.

edit: added "for the sub-delims" in the last paragraph.

mahmoud commented 7 years ago

Hey Jonathan! These are some good points and I see where you're coming from, but I recommend reviewing this bit of hyperlink design.

To summarize here, Hyperlink is meant to represent both escaped and unescaped versions. Interpretation of the constituent parts is based on usage. If you call .to_uri(), you're telling hyperlink to interpret the URL as a display-oriented IRI. Same goes for .to_iri() and URIs.

This contrasts with boltons' urlutils which fully decodes inputs, an approach which is less powerful because it doesn't allow for mixed representations (which are quite common in practice).

There are bound to be some corner cases and I'm glad we're working through them. A lot of these scenarios don't seem to have right answers, but I'm still confident with a bit more thought we will arrive at some pleasantly intuitive behaviors :)

jvanasco commented 7 years ago

I understand your points. The points I'm trying to convey are not coming across and a bit nuanced, so I'll try again (if you'll put up with me). I've had my head buried in these RFCs and edge-cases for the past month, so I've got a few of these concepts dancing in my head and keeping the right words from forming.

  1. Assuming the "mixed use' concept, # and ? should never be valid within the path argument as they could never appear unencoded in the path -- decoding them would inherently change the URI.

I'm not talking about creating a url object from a full url string, but specifically using the keyword 'path' argument to the constructor (as in the example above). The 'human'/browser-bar representation will always have these two elements encoded as part of the path, because they would signify other components to the URL structure if decoded and necessarily point to a resource that is different.

For example, consider this URI

http://example.com/path/to/foo/%23/bar/biz?bang=foo&bash=widget#topbar

The path must always the following, and can never be decoded in a browser bar:

 example.com/path/to/foo/%23/bar/biz

Unescaping the %23 will turn everything after the decoded # into a URI fragment:

example.com/path/to/foo/#/bar/biz?bang=foo&bash=widget#topbar

The same holds true for ?. If decoded within a path, it signifies the query string. If either of those elements were encountered as "input" to the path argument, you'd always display the encoded version. If someone were to submit either decoded character as part of the path, they are most likely dealing with a bug or using the library wrong.

  1. To clarify something that did not come across well above (and I just edited to note it with emphasis) the part I think would be best as explicitly encoded (or unencoded, but still explicit) are the RFC sub-delim characters.

Using the examples in the docs, these two URLs will essentially always mean the same thing - so you can easily change context between them without concern. It doesn't matter how this is represented.

http://example.com/café'
http://example.com/caf%C3%A9'

However, the RFC sub-delims and the percent sign are not guaranteed to be the same...

!$&'()*+,;= %

I don't have an answer or a proposal for those characters. That's where the bulk of edge cases and exploration will be -- as there is a lot of mixed usage in there. IMHO, the input should be explicitly marked as encoded or not -- because they can shift the 'meaning' and intent of the URL.

It looks like the library is doing the right thing by those characters and leaving them as-is / not recoding them.

mahmoud commented 7 years ago

I see what you mean, and yes, enhancing the constructor behavior is seeming more and more like the way to go forward. I mentioned that over here, and I'm glad someone else with their head in the RFCs is kind of independently feeling that angle, too.

wsanchez commented 7 years ago

I'm still bothered by the fact that the path argument will accept both encoded and unencoded input. It seems wrong to accept already encoded inputs here.

I think @glyph is saying that it's desirable for str(URL.fromText(X)) to always give you back X and not a normalized form of X. I think that's nuts, because (a) as @glyph said, it's not possible and (b) this bug is the result of trying to do the impossible.

If you want the original string, save it somewhere; this isn't a string object.

wsanchez commented 7 years ago

I guess what I mean is that if URL.fromText needs to behave like a browser would, the associated weirdness should live there, not in __init__.

glyph commented 7 years ago

I'm still bothered by the fact that the path argument will accept both encoded and unencoded input. It seems wrong to accept already encoded inputs here.

The problem is that "encoded" and "unencoded" are ambiguous terms here. http://example.com/%FF is a perfectly valid URI, with a perfectly valid textual representation. However, it is simultaneously an un-decodable IRI which has no valid transformation and also a perfectly valid IRI that is equivalent to the URI.

There are also representations of mixed validity, like http://example.com/%FF/%C3%A9/ which can be rendered either as-is or as http://example.com/%FF/é.

glyph commented 7 years ago

You can also have validation failures going in the other direction, with un-representable / invalid IDNA names in the domain part.

wsanchez commented 7 years ago

Is part of the problem here that we're using the same object for URI and IRI? Seem they have different rules, so maybe shouldn't be the same…

mahmoud commented 7 years ago

@wsanchez URIs and IRIs are both URLs. Because mixed URLs occur in practice, it's not just the rules that overlap. I think it's quite useful and powerful to pursue this hybrid approach.

jvanasco commented 7 years ago

There are also representations of mixed validity, like http://example.com/%FF/%C3%A9/ which can be rendered either as-is or as http://example.com/%FF/é.

In these cases of mixed-validity, an input of http://example.com/%FF/é implicitly states the URL has already been decoded because of the presence of é -- therefore the %FF was decoded from %25FF. While %FF doesn't decode to an ASCII element, it does decode to a (likely-unused and unsupported) character and would be rendered with a placeholder. (e.g. Python decodes it to \xff, browser bar shows unsupported character)

So I would interpret these as distinct resources:

`/%FF/%C3%A9` - unknown decoding status
`/%FF/é` - elements have been decoded from `/%25FF/%C3%A9`
`/\xff/é` - elements have been decoded from `/%FF/%C3%A9`
mahmoud commented 7 years ago

So, putting aside the design decision to support mixed encoding states, I think I have just committed a solution that fixes this issue (as well as #8 and countless other potential problems) while staying true to URL's unified design.

In short, I add a bunch of checking for delimiters to various parameters. Some characters simply must always be encoded in some contexts, a fact that is made clear from the discussion above, not to mention the maximal descriptor for the encoding functions. The URL constructor now checks parameters for these delimiters and raises ValueError accordingly.

Let me know if you have any thoughts!

markrwilliams commented 7 years ago

How do I add a URL parameter that contains #? I can't do:

someURL = URL()
someURL.add(u'parameter', u'#value').to_text()

because it fails with:

Traceback (most recent call last):
  File "<stdin>", line 1, in <module>
  File "hyperlink/_url.py", line 1363, in add
    return self.replace(query=self.query + ((name, value),))
  File "hyperlink/_url.py", line 937, in replace
    uses_netloc=_optional(uses_netloc, self.uses_netloc)
  File "hyperlink/_url.py", line 698, in __init__
    for (k, v) in query
  File "hyperlink/_url.py", line 698, in <genexpr>
    for (k, v) in query
  File "hyperlink/_url.py", line 433, in _textcheck
    % (''.join(delims), name, value))
ValueError: one or more reserved delimiters &# present in query parameter value: u'#value'

But with older versions I can do:

someURL.add(u'parameter', u'#value').to_text()
# u'?parameter=%23value'
glyph commented 6 years ago

@markrwilliams See #44, #11 .