Open mitechie opened 12 years ago
Technically that url is invalid because urls are not supposed to contain spaces. Browsers and other HTTP clients tend to "helpfully" rewrite the invalid url, although I'm not sure if there are any rules as to the right way to do it (it's not as simple as urllib.quote, since you don't want to encode the ampersands and other characters that would normally be percent-encoded)
Yea, I was trying to find some way to break it apart and manually escape it, but when it worked in urllib I wondered if maybe this is something that could be picked up and ported or something.
Under the covers, urllib.urlopen() calls urllib.quote() with a safe character set of "%/:=&?~#+!$,;'@()*[]|". Is it worth adding that "helpfulness" to Tornado's httpclient? If that type of magic is better handled outside Tornado, perhaps this issue can be closed.
Is there any documentation of what browsers do (or should do) here? (maybe in html5?) If there's a standard to follow then I'm OK with adding that to Tornado, but I'd rather not add a bit of copy/pasted magic that may or may not be the same as what's used elsewhere.
(coming from a Java background, so 'scuse me if I say anything too silly here)
I think at least in Java most libraries try to follow the standards, but also open some exceptions to be more lenient on what users type, as well as sometimes platform/browser specific hacks.
Maybe related to Jon Postel's "be liberal in what you accept, and conservative in what you send"... in this case it appears Tornado HTTP Clients is in a tough position of being at both ends, receiving URL's from users that may be broken, while having to send it to the server...
I think it could be solved providing documentation/example to users on how to escape their URL's prior to calling Tornado?
This updated example gives me similar error output, just with Python 3 and minor changes to print/import statements.
import urllib.request
from tornado import httpclient
url = "https://blogs.msdn.com/b/jmeier/archive/2012/05/13/the-rapid-research-method.aspx?utm_source=feedburner&utm_medium=feed&utm_campaign=Feed: jmeier (J.D. Meier's Blog)&Redirected=true"
fh = urllib.request.urlopen(url)
# This will load up the content just peachy...
content = fh.read()
# This will get me a 400 bad request response.
http = httpclient.HTTPClient()
try:
response = http.fetch(url) # type: httpclient.HttpResponse
print("Content should be in here.")
except Exception as e:
print("but it goes BOOM!")
But this one with urllib.parse.quote
works
import urllib.request
import urllib.parse
from tornado import httpclient
url = "https://blogs.msdn.com/b/jmeier/archive/2012/05/13/the-rapid-research-method.aspx?utm_source=feedburner&utm_medium=feed&utm_campaign=Feed: jmeier (J.D. Meier's Blog)&Redirected=true"
safe_with_percent = "!#$&'()*+,/:;=?@[]~"
url_clean = urllib.parse.quote(url, safe_with_percent)
fh = urllib.request.urlopen(url_clean)
# This will load up the content just peachy...
content = fh.read()
# This will get me a 400 bad request response.
http = httpclient.HTTPClient()
try:
response = http.fetch(url_clean) # type: httpclient.HttpResponse
print("Content should be in here.")
except Exception as e:
print("but it goes BOOM!")
The resulting value of url_clean
for me is: 'https://blogs.msdn.com/b/jmeier/archive/2012/05/13/the-rapid-research-method.aspx?utm_source=feedburner&utm_medium=feed&utm_campaign=Feed:%20jmeier%20(J.D.%20Meier\\'s%20Blog)&Redirected=true', which works with urllib
, requests
, and with my browser (redirect)
The escape value in the working code is partially (there's another fallback value) copied from the requests
library. From requests.utils.requote
, which calls urllib.parse.quote
.
Alternatively, it could be possible incorporate some escape value like that one from the requests
library in Tornado. It looks to me like the HTTPClient
's already have a dependency on urllib
, but it is just not calling the quote
function before using the URL - I think.
Just adding a call to urllib.parse.quote
would break things if the input URL were already correctly encoded. It's very important that being liberal in accepting not-quite-correct input doesn't lead to handling correct input incorrectly. I think this probably means copying the whole requote
method instead of just its use of quote()
.
My only hesitation is about copy/pasting strings of magic without authoritative references. In the years since this issue was opened, the WHATWG has published a standard for URLs-in-the-real-world (https://url.spec.whatwg.org/) which is what we should be following here.
Good point on the danger of simply calling urllib.parse.quote
twice.
I think this probably means copying the whole requote method instead of just its use of quote().
That would be a small portion of code copied, I think that'd be OK...
My only hesitation is about copy/pasting strings of magic without authoritative references. In the years since this issue was opened, the WHATWG has published a standard for URLs-in-the-real-world (https://url.spec.whatwg.org/) which is what we should be following here.
And here's where I got stuck. The WHATWG seems quite broad, but also a bit hard to find exactly which character to use in that requote
ported function.
There's a library, whatwg-url
, that correctly parses the URL from this issue, and appears to based on WHATWG standard, though I hadn't heard about it until searching online.
>>> import whatwg_url
>>> u = "https://blogs.msdn.com/b/jmeier/archive/2012/05/13/the-rapid-research-method.aspx?utm_source=feedburner&utm_medium=feed&utm_campaign=Feed: jmeier (J.D. Meier's Blog)&Redirected=true"
>>> parseresult = whatwg_url.urlparse(u)
>>> parseresult
ParseResult(scheme='https', netloc='blogs.msdn.com', path='/b/jmeier/archive/2012/05/13/the-rapid-research-method.aspx', params='', query='utm_source=feedburner&utm_medium=feed&utm_campaign=Feed:%20jmeier%20(J.D.%20Meier%27s%20Blog)&Redirected=true', fragment='')
>>> parseresult.geturl()
'https://blogs.msdn.com/b/jmeier/archive/2012/05/13/the-rapid-research-method.aspx?utm_source=feedburner&utm_medium=feed&utm_campaign=Feed:%20jmeier%20(J.D.%20Meier%27s%20Blog)&Redirected=true'
>>>
I guess we probably don't want to add a library just for that... but also it's not easy (for me at least) to find which characters are considered safe (with/out percent) for the requote
function :confused:
I've hit a url that the httpclient it failing for that works with urllib. Below is a snippet of code with the url and showing it produces a 400 bad request from the httpclient side.