scrapy / w3lib

Python library of web-related functions
BSD 3-Clause "New" or "Revised" License
385 stars 104 forks source link

add_or_replace_parameter lacks control of what is encoded #106

Open rennerocha opened 6 years ago

rennerocha commented 6 years ago

add_or_replace_parameter uses urlencode function to convert a string to percent-encoded ASCII string: https://github.com/scrapy/w3lib/blob/master/w3lib/url.py#L229

If the value of the parameter is already percent-encoded (in my scenario, this value came from a JSON response of an API), it is not possible to specify safe characters that should not be encoded. This argument is available in urlencode signature: https://docs.python.org/3/library/urllib.parse.html#urllib.parse.urlencode

My suggestion is to include a safe argument in add_or_replace_parameter to improve the control over the encoding of your values: Actual: w3lib.url.add_or_replace_parameter (url, name, new_value) New: w3lib.url.add_or_replace_parameter(url, name, new_value, safe='')

In [1]: from urllib.parse import urlencode
   ...: from w3lib.url import add_or_replace_parameter
   ...: # This came from an API response already encoded
   ...: full_hierarchy = 'Appliances_Air+Purifiers+%26+Dehumidifiers_Air+Purifiers'
   ...: url = 'http://example.com'

In [2]: urlencode({'hierarchy': full_hierarchy})
Out[2]: 'hierarchy=Appliances_Air%2BPurifiers%2B%2526%2BDehumidifiers_Air%2BPurifiers'

In [3]: urlencode({'hierarchy': full_hierarchy}, safe='+%')
Out[3]: 'hierarchy=Appliances_Air+Purifiers+%26+Dehumidifiers_Air+Purifiers'

In [4]: add_or_replace_parameter(url, 'hierarchy', full_hierarchy)
Out[4]: 'http://example.com?hierarchy=Appliances_Air%2BPurifiers%2B%2526%2BDehumidifiers_Air%2BPurifiers'
kmike commented 6 years ago

Hey @rennerocha!

Do you have any other use cases for overriding safe characters?

It looks like in your case a workaround could be to unescape parameters before passing them to add_or_replace_parameter, as this function works with raw, unescaped parameter values - these are values server gets after decoding.

Allowing to override safe characters can be seen as a performance optimization, a hack to avoid unescaping and re-escaping, unless you see other use cases for it. While it may solve a problem, I find the resulting API unintuitive. E.g. does safe parameter affect only parameters, or does it affect encoding of the original URL as well? What should be its value when you want to pass already-escaped parameters, or just use something as-is?