Open c29b6996-d572-4c17-906e-e392d8cb0cd9 opened 6 years ago
urljoining with '?' will not clear a query string:
ACTUAL:
>>> import urllib.parse
>>> urllib.parse.urljoin('http://a/b/c?d=e', '?')
'http://a/b/c?d=e'
EXPECTED: 'http://a/b/c' (optionally, with a ? at the end)
WhatWG's URL standard expects a relative URL consisting of only a ? to replace a query string:
https://url.spec.whatwg.org/#relative-state
Seen in versions 3.6 and 2.7, but probably also affects later versions.
I'm working on a patch for this and can have one up in the next week or so, once I get the CLA signed and other boxes ticked. I'm new to the Github process but hopefully it will be a good start for the discussion.
Python follows not WhatWG but RFC. https://tools.ietf.org/html/rfc3986#section-5.2.2 is proper definition for url joining algorithm.
In this case, the RFC is mismatched from the actual behaviour of browsers (as described and codified by WhatWG). It was surprising to me that urljoin() didn't do what I percieved as "the right thing" (and I expect other users would too).
I would personally expect urljoin to do "the thing that everybody else does". Is there a sensible way to reduce this mismatch?
For reference, Java's stdlib does what I would expect here:
URI base = URI.create("https://example.com/?a=b");
URI rel = base.resolve("?");
System.out.println(rel);
The relevant part in the RFC pseudo code is
if defined(R.query) then
T.query = R.query;
else
T.query = Base.query;
endif;
which is implemented in urljoin as:
if not query:
query = bquery
Is this correct? Should the code not say "if query is not None"? (I can't see in the RFC a definition of defined()).
Sorry, urlparse returns '' rather than None when there is no query. So we indeed need to check something like if '?' not in url: or what's in Paul's patch.
However, my main point was to question whether fixing this is actually in contradiction with the RFC.
Reading more into this, from section 5.2,1:
A component is undefined if its associated delimiter does not appear in the URI reference
So you could say that since there is a '?', the query component is *defined, but *empty. This would mean that assigning the target query to be '' has the desired effect as implemented by browsers and other languages' standard libraries.
Yes, this a bug with urljoin in Python.
I compared the behavior against Ruby, and Golang.
require 'uri'
base_url = 'https://www.example.com/?a=b'
relative_url = '?'
url = URI.join(base_url, relative_url).to_s
puts url
Output:
And go's https://pkg.go.dev/net/url@go1.19beta1#URL.ResolveReference, ResolveReference resolves a URI reference to an absolute URI from an absolute base URI u, per RFC 3986 Section 5.2.
package main
import (
"fmt"
"net/url"
)
func main() {
base, _ := url.Parse("https://example.com/?a=b")
u, _ := url.Parse("?")
fmt.Println(base.ResolveReference(u))
}
In python
Python 3.12.0a7+ (heads/main:4898415df7, Apr 20 2023, 16:46:44) [GCC 11.3.0] on linux
Type "help", "copyright", "credits" or "license" for more information.
>>> import urllib.parse
>>> urllib.parse.urljoin("http://www.example.com/?a/b","?")
'http://www.example.com/?a/b'
We can treat this as bug and fix it. However, there are very high chances that people are relying on the current no-op behavior, and this change in behavior can cause frameworks to break. This sounds a bit costly to me.
This is not only about "?". For example, "?#z" also should set an empty query. There may be also issues with empty authority ("//"), but this is less clear.
Currently urllib.parse
does not distinguish between undefined (no delimiter) and empty (there is a delimiter) components, both are represented by empty strings (see #99962). To make the behavior RFC3986 compatible we need a new parser. So I refactored the code in #123273, made internal helper functions more RFC3986 compatible. Public functions are now wrappers around them. Later I will add options to make public parsing/unparsing API also optionally RFC3986 compatible.
It seems that there is an error in RFC3986. According to its algorithm, an the relative reference should always override fragment. And since an empty string does not have fragment, it urljoin() with empty string should remove fragment from the base URI. But older RFCs explicitly said that the original base URI should be returned. RFC3986 does not contain such statement explicitly and there is no such case in examples.
As for "http:?" etc, RFC3986 allows to ignore the scheme if it is the same as the scheme in the base URI for compatibility. Current urllib.parse
parser is not strict according to RFC3986. We may add an option for strict mode, but this is a different issue.
Currently urllib.parse does not distinguish between undefined (no delimiter) and empty (there is a delimiter) components, both are represented by empty strings (see https://github.com/python/cpython/issues/99962). To make the behavior RFC3986 compatible we need a new parser. So I refactored the code in https://github.com/python/cpython/pull/123273, made internal helper functions more RFC3986 compatible.
Thanks for explaining this clearly and taking this approach, @serhiy-storchaka .
My one feedback is, we make this change early in the cycle of release so that we give ample time for many tools that rely on this underlying parser to detect the behavior change (if they relied on a previous one).
Yet one bug (and perhaps more serious one) -- removing trailing semicolon from the path.
>>> urljoin('http:/a/b/c/d;#f', '#z')
'http:///a/b/c/d#z'
>>> urldefrag('http:/a/b/c/d;#f')
DefragResult(url='http:///a/b/c/d', fragment='f')
RFC3986 does not define "params". The URI consists of 5 components: scheme, authority, path, query, fragment. '/b/c/d;'
is the path, and the urllib
functions mutilate it.
This is why I think that it is worth to backport the solution.
RFC3986 does not define "params".
True. And when I dug a bit into this topic yesterday. Using params is highly discouraged now, and I think, very old Sun Servers and FTPs used a params for storing session state
The backport is fine. But it carries the risk of breaking some applications. Calling it out and early backport is alright IMO.
Note: these values reflect the state of the issue at the time it was migrated and might not reflect the current state.
Show more details
GitHub fields: ```python assignee = None closed_at = None created_at =
labels = ['type-bug', 'library', '3.9', '3.10', '3.11']
title = "urljoining an empty query string doesn't clear query string"
updated_at =
user = 'https://github.com/thetorpedodog'
```
bugs.python.org fields:
```python
activity =
actor = 'pfish'
assignee = 'none'
closed = False
closed_date = None
closer = None
components = ['Library (Lib)']
creation =
creator = 'pfish'
dependencies = []
files = []
hgrepos = []
issue_num = 32779
keywords = ['patch']
message_count = 7.0
messages = ['311704', '311937', '312201', '312223', '394648', '394649', '394664']
nosy_count = 4.0
nosy_names = ['orsenthil', 'asvetlov', 'pfish', 'iritkatriel']
pr_nums = ['5645']
priority = 'normal'
resolution = None
stage = 'patch review'
status = 'open'
superseder = None
type = 'behavior'
url = 'https://bugs.python.org/issue32779'
versions = ['Python 3.9', 'Python 3.10', 'Python 3.11']
```
Linked PRs