python / cpython

The Python programming language
https://www.python.org
Other
62.38k stars 29.96k forks source link

urljoining an empty query string doesn't clear query string #76960

Open c29b6996-d572-4c17-906e-e392d8cb0cd9 opened 6 years ago

c29b6996-d572-4c17-906e-e392d8cb0cd9 commented 6 years ago
BPO 32779
Nosy @orsenthil, @asvetlov, @thetorpedodog, @iritkatriel
PRs
  • python/cpython#5645
  • Note: these values reflect the state of the issue at the time it was migrated and might not reflect the current state.

    Show more details

    GitHub fields: ```python assignee = None closed_at = None created_at = labels = ['type-bug', 'library', '3.9', '3.10', '3.11'] title = "urljoining an empty query string doesn't clear query string" updated_at = user = 'https://github.com/thetorpedodog' ``` bugs.python.org fields: ```python activity = actor = 'pfish' assignee = 'none' closed = False closed_date = None closer = None components = ['Library (Lib)'] creation = creator = 'pfish' dependencies = [] files = [] hgrepos = [] issue_num = 32779 keywords = ['patch'] message_count = 7.0 messages = ['311704', '311937', '312201', '312223', '394648', '394649', '394664'] nosy_count = 4.0 nosy_names = ['orsenthil', 'asvetlov', 'pfish', 'iritkatriel'] pr_nums = ['5645'] priority = 'normal' resolution = None stage = 'patch review' status = 'open' superseder = None type = 'behavior' url = 'https://bugs.python.org/issue32779' versions = ['Python 3.9', 'Python 3.10', 'Python 3.11'] ```

    Linked PRs

    c29b6996-d572-4c17-906e-e392d8cb0cd9 commented 6 years ago

    urljoining with '?' will not clear a query string:

    ACTUAL:
    >>> import urllib.parse
    >>> urllib.parse.urljoin('http://a/b/c?d=e', '?')
    'http://a/b/c?d=e'

    EXPECTED: 'http://a/b/c' (optionally, with a ? at the end)

    WhatWG's URL standard expects a relative URL consisting of only a ? to replace a query string:

    https://url.spec.whatwg.org/#relative-state

    Seen in versions 3.6 and 2.7, but probably also affects later versions.

    c29b6996-d572-4c17-906e-e392d8cb0cd9 commented 6 years ago

    I'm working on a patch for this and can have one up in the next week or so, once I get the CLA signed and other boxes ticked. I'm new to the Github process but hopefully it will be a good start for the discussion.

    asvetlov commented 6 years ago

    Python follows not WhatWG but RFC. https://tools.ietf.org/html/rfc3986#section-5.2.2 is proper definition for url joining algorithm.

    c29b6996-d572-4c17-906e-e392d8cb0cd9 commented 6 years ago

    In this case, the RFC is mismatched from the actual behaviour of browsers (as described and codified by WhatWG). It was surprising to me that urljoin() didn't do what I percieved as "the right thing" (and I expect other users would too).

    I would personally expect urljoin to do "the thing that everybody else does". Is there a sensible way to reduce this mismatch?

    For reference, Java's stdlib does what I would expect here:

    URI base = URI.create("https://example.com/?a=b");
    URI rel = base.resolve("?");
    System.out.println(rel);

    https://example.com/?

    iritkatriel commented 3 years ago

    The relevant part in the RFC pseudo code is

               if defined(R.query) then
                  T.query = R.query;
               else
                  T.query = Base.query;
               endif;

    which is implemented in urljoin as:

            if not query:
                query = bquery

    Is this correct? Should the code not say "if query is not None"? (I can't see in the RFC a definition of defined()).

    iritkatriel commented 3 years ago

    Sorry, urlparse returns '' rather than None when there is no query. So we indeed need to check something like if '?' not in url: or what's in Paul's patch.

    However, my main point was to question whether fixing this is actually in contradiction with the RFC.

    c29b6996-d572-4c17-906e-e392d8cb0cd9 commented 3 years ago

    Reading more into this, from section 5.2,1:

    A component is undefined if its associated delimiter does not appear in the URI reference

    So you could say that since there is a '?', the query component is *defined, but *empty. This would mean that assigning the target query to be '' has the desired effect as implemented by browsers and other languages' standard libraries.

    orsenthil commented 1 year ago

    Yes, this a bug with urljoin in Python.

    I compared the behavior against Ruby, and Golang.

    require 'uri'
    
    base_url = 'https://www.example.com/?a=b'
    relative_url = '?'
    
    url = URI.join(base_url, relative_url).to_s
    
    puts url

    Output:

    https://www.example.com/?

    And go's https://pkg.go.dev/net/url@go1.19beta1#URL.ResolveReference, ResolveReference resolves a URI reference to an absolute URI from an absolute base URI u, per RFC 3986 Section 5.2.

    package main
    
    import (
        "fmt"
        "net/url"
    )
    
    func main() {
        base, _ := url.Parse("https://example.com/?a=b")
        u, _ := url.Parse("?")
        fmt.Println(base.ResolveReference(u))
    }

    https://example.com/?

    In python

    Python 3.12.0a7+ (heads/main:4898415df7, Apr 20 2023, 16:46:44) [GCC 11.3.0] on linux
    Type "help", "copyright", "credits" or "license" for more information.
    >>> import urllib.parse
    >>> urllib.parse.urljoin("http://www.example.com/?a/b","?")
    'http://www.example.com/?a/b'

    We can treat this as bug and fix it. However, there are very high chances that people are relying on the current no-op behavior, and this change in behavior can cause frameworks to break. This sounds a bit costly to me.

    serhiy-storchaka commented 3 weeks ago

    This is not only about "?". For example, "?#z" also should set an empty query. There may be also issues with empty authority ("//"), but this is less clear.

    Currently urllib.parse does not distinguish between undefined (no delimiter) and empty (there is a delimiter) components, both are represented by empty strings (see #99962). To make the behavior RFC3986 compatible we need a new parser. So I refactored the code in #123273, made internal helper functions more RFC3986 compatible. Public functions are now wrappers around them. Later I will add options to make public parsing/unparsing API also optionally RFC3986 compatible.

    It seems that there is an error in RFC3986. According to its algorithm, an the relative reference should always override fragment. And since an empty string does not have fragment, it urljoin() with empty string should remove fragment from the base URI. But older RFCs explicitly said that the original base URI should be returned. RFC3986 does not contain such statement explicitly and there is no such case in examples.

    As for "http:?" etc, RFC3986 allows to ignore the scheme if it is the same as the scheme in the base URI for compatibility. Current urllib.parse parser is not strict according to RFC3986. We may add an option for strict mode, but this is a different issue.

    orsenthil commented 2 weeks ago

    Currently urllib.parse does not distinguish between undefined (no delimiter) and empty (there is a delimiter) components, both are represented by empty strings (see https://github.com/python/cpython/issues/99962). To make the behavior RFC3986 compatible we need a new parser. So I refactored the code in https://github.com/python/cpython/pull/123273, made internal helper functions more RFC3986 compatible.

    Thanks for explaining this clearly and taking this approach, @serhiy-storchaka .

    My one feedback is, we make this change early in the cycle of release so that we give ample time for many tools that rely on this underlying parser to detect the behavior change (if they relied on a previous one).

    serhiy-storchaka commented 2 weeks ago

    Yet one bug (and perhaps more serious one) -- removing trailing semicolon from the path.

    >>> urljoin('http:/a/b/c/d;#f', '#z')
    'http:///a/b/c/d#z'
    >>> urldefrag('http:/a/b/c/d;#f')
    DefragResult(url='http:///a/b/c/d', fragment='f')

    RFC3986 does not define "params". The URI consists of 5 components: scheme, authority, path, query, fragment. '/b/c/d;' is the path, and the urllib functions mutilate it.

    This is why I think that it is worth to backport the solution.

    orsenthil commented 2 weeks ago

    RFC3986 does not define "params".

    True. And when I dug a bit into this topic yesterday. Using params is highly discouraged now, and I think, very old Sun Servers and FTPs used a params for storing session state

    The backport is fine. But it carries the risk of breaking some applications. Calling it out and early backport is alright IMO.