python / cpython

The Python programming language
https://www.python.org
Other
62.59k stars 30.04k forks source link

urlparse.parse_qs should take argument for query separator #64315

Open 50fe2c9d-2e9c-4082-805f-214289ced5dd opened 10 years ago

50fe2c9d-2e9c-4082-805f-214289ced5dd commented 10 years ago
BPO 20116
Nosy @orsenthil, @bitdancer, @poleto, @kctherookie, @jacobtylerwalls
Files
  • parse_querystring.py: example of output
  • Note: these values reflect the state of the issue at the time it was migrated and might not reflect the current state.

    Show more details

    GitHub fields: ```python assignee = None closed_at = None created_at = labels = ['type-feature'] title = 'urlparse.parse_qs should take argument for query separator' updated_at = user = 'https://bugs.python.org/rubenorduz' ``` bugs.python.org fields: ```python activity = actor = 'jacobtylerwalls' assignee = 'none' closed = False closed_date = None closer = None components = [] creation = creator = 'ruben.orduz' dependencies = [] files = ['48146'] hgrepos = [] issue_num = 20116 keywords = [] message_count = 15.0 messages = ['207237', '207241', '207242', '207243', '207244', '207261', '207262', '263491', '263798', '263842', '263843', '335768', '335782', '335801', '397208'] nosy_count = 7.0 nosy_names = ['orsenthil', 'r.david.murray', 'ruben.orduz', 'luiz.poleto', 'kc', 'Kobi Gana', 'jacobtylerwalls'] pr_nums = [] priority = 'normal' resolution = None stage = None status = 'open' superseder = None type = 'enhancement' url = 'https://bugs.python.org/issue20116' versions = ['Python 3.5'] ```

    50fe2c9d-2e9c-4082-805f-214289ced5dd commented 10 years ago

    Currently urlparse.parse_qs (http://hg.python.org/cpython/file/2.7/Lib/urlparse.py#l150) assumes and uses ';' as a query string separator with no way to overwrite that. There are several web service APIs out there that use ';' as list separator (e.g. [URL]?fruits=lemon;lime&family=citrus). Although ';' seems like a sensible choice for a default, there should be a way to overwrite it.

    bitdancer commented 10 years ago

    As an enhancement, this could only go into 3.5.

    50fe2c9d-2e9c-4082-805f-214289ced5dd commented 10 years ago

    So, are you suggesting I should change to a different type if desired for 2.7.x or leave for release to 3.5 and then submit a patch to backport it to 2.7.x? I apologize, not sure how the workflow works in these cases. Thanks.

    bitdancer commented 10 years ago

    I'm saying that this is a change that can be made only in 3.5. if you want to submit a patch here for 2.7 for other people to use that's fine, but it won't get applied.

    50fe2c9d-2e9c-4082-805f-214289ced5dd commented 10 years ago

    Ah, gotcha. I think I will leave as is then. Thanks for clarifying.

    orsenthil commented 10 years ago

    If you could point to RFC which states the list of characters which can be used as valid query string separators, we can include that list. (Of course in 3.5)

    50fe2c9d-2e9c-4082-805f-214289ced5dd commented 10 years ago

    Senthil,

    The RFC can be found here: http://tools.ietf.org/html/rfc3986#section-2.2

    55a6ac4b-dfe7-421c-a85f-9db0c32780ac commented 8 years ago

    If this bug is to be moved forward, we should consider this:

    The RFC 3986 defines that a query can have any of these characters: /?:@-._~!$&'()*+,;= ALPHA DIGIT %HH (encoded octet)

    But does not define how the data should be interpreted, leaving that to the naming authority and the URI schema (although http/https doesn't specify it as well; see RFC 7230).

    OTOH, parse_qs (both on 2.x and 3.x) is very specific that the query string is of type application/x-www-form-urlencoded; which defines that the name is separated from the value by '=' and name/value pairs are separated from each other by '&', although the use of ';' to separate the pairs is only suggested to be supported by HTTP server implementors.

    It could be that adding support to the characters specified by RFC 3986 pose as a challenge since there is no fixed schema and they can be freely used by the naming authority so perhaps we could add a parameter to enable/disable ';' as a pair separator?

    orsenthil commented 8 years ago

    Luiz,

    The original question was about introducing a parameter to override query string separate ';'.

    If we do with enable or disable, then we should provide another option for query string separator.

    The OP provided one example of query string which had & as a separator along with ';'. I wonder how the parsing of that should be.

    The pointer to the RFC makes me think that is alright to provide an option to 'override' the default separator instead of providing an enable/disable.

    I would like to hear opposite thoughts on this.

    55a6ac4b-dfe7-421c-a85f-9db0c32780ac commented 8 years ago

    Based on the example provided by the OP, it appears that he would expect the output to be: {'family': ['citrus'], 'fruits': ['lemon;lime']}

    Since the W3C recommendation for the application/x-www-form-urlencoded type specify using '&' to separate the parameters in the query string (';' is not mentioned there), I recommended a parameter for disabling the use of ';' as a separator (but '&' will still be the separator to be used).

    The only thing I see against using the RFC is that although it specifies which characters are valid in a query string, it does not define how they should be used; that is done by W3C's application/x-www-form-urlencoded and it is very specific about using '&' as a separator.

    50fe2c9d-2e9c-4082-805f-214289ced5dd commented 8 years ago

    Hi all,

    OP here. My intent was to optionally pass a separator parameter, _not_ enable/disable toggle.

    f9e1f785-0c57-4bb0-86b9-4bcc5b65b872 commented 5 years ago

    Hi all,

    Please take the next case: The url - http://hostname.domain/mypage.asp?fields=id&query=%22((release%3D{id%3D1004});(sprint%3D{id%3D1040});(team%3D{id%3D1004});(severity%3D{id%3D%27list_node.severity.urgent%27});!phase%3D{id+IN+%27phase.defect.closed%27,%27phase.defect.duplicate%27,%27phase.defect.rejected%27})%22

    The Query as string - fields=id&query=%22((release%3D{id%3D1004});(sprint%3D{id%3D1040});(team%3D{id%3D1004});(severity%3D{id%3D%27list_node.severity.urgent%27});!phase%3D{id+IN+%27phase.defect.closed%27,%27phase.defect.duplicate%27,%27phase.defect.rejected%27})%22

    The expected pairs -

    1. fields=id
    2. query=%22((release%3D{id%3D1004});(sprint%3D{id%3D1040});(team%3D{id%3D1004});(severity%3D{id%3D%27list_node.severity.urgent%27});!phase%3D{id+IN+%27phase.defect.closed%27,%27phase.defect.duplicate%27,%27phase.defect.rejected%27})%22

    The actual output -

    1. ('fields', 'id')
    2. ('query', '"((release={id=1004})')
    3. ('(sprint={id=1040})', '')
    4. ('(team={id=1004})', '')
    5. ("(severity={id='list_node.severity.urgent'})", '')
    6. ('!phase={id IN \'phase.defect.closed\',\'phase.defect.duplicate\',\'phase.defect.rejected\'})"', '')
    43bf79e6-cabd-4588-9fd2-6d0d4ed157af commented 5 years ago

    W3C allows both constructs, ampersand and semicolon. https://www.w3.org/TR/html401/appendix/notes.html#h-B.2.2

    Especially servlet containers and servers running CGI programs often use semicolons as a separator.

    I would say to parse either ampersands OR semicolons and keep a priority to ampersands.

    For example the query strings:

    ?fields=id&query=%22((release%3D{id%3D1004});(sprint%3D{id%3D1040});(team%3D{id%3D1004});(severity%3D{id%3D%27list_node.severity.urgent%27});!phase%3D{id+IN+%27phase.defect.closed%27,%27phase.defect.duplicate%27,%27phase.defect.rejected%27})%22

    ?fruits=lemon;lime&family=citrus

    should be parsed with & separators only.

    The modified example without & character: ?fruits=lemon;family=citrus

    can be parsed with semicolon as a separator because it contains both '=' and ';' but no '&' characters.

    f9e1f785-0c57-4bb0-86b9-4bcc5b65b872 commented 5 years ago

    We are on the same page and we should also consider marked this as defect.

    Thanks

    On Sun, Feb 17, 2019 at 7:44 PM nr \report@bugs.python.org\ wrote:

    nr \aktiophi@googlemail.com\ added the comment:

    W3C allows both constructs, ampersand and semicolon. https://www.w3.org/TR/html401/appendix/notes.html#h-B.2.2

    Especially servlet containers and servers running CGI programs often use semicolons as a separator.

    I would say to parse either ampersands OR semicolons and keep a priority to ampersands.

    For example the query strings:

    ?fields=id&query=%22((release%3D{id%3D1004});(sprint%3D{id%3D1040});(team%3D{id%3D1004});(severity%3D{id%3D%27list_node.severity.urgent%27});!phase%3D{id+IN+%27phase.defect.closed%27,%27phase.defect.duplicate%27,%27phase.defect.rejected%27})%22

    ?fruits=lemon;lime&family=citrus

    should be parsed with & separators only.

    The modified example without & character: ?fruits=lemon;family=citrus

    can be parsed with semicolon as a separator because it contains both '=' and ';' but no '&' characters.

    ---------- nosy: +nr


    Python tracker \report@bugs.python.org\ \https://bugs.python.org/issue20116\


    b982ca5e-273f-4a9d-b6df-60abec2375c9 commented 3 years ago

    Greetings. I believe this is mooted by bpo-42967 as well as changes even prior to that.

    https://bugs.python.org/issue42967