tornadoweb / tornado

Tornado is a Python web framework and asynchronous networking library, originally developed at FriendFeed.
http://www.tornadoweb.org/
Apache License 2.0
21.73k stars 5.5k forks source link

Tornado httpclient fails requesting a url that urllib works with. #559

Open mitechie opened 12 years ago

mitechie commented 12 years ago

I've hit a url that the httpclient it failing for that works with urllib. Below is a snippet of code with the url and showing it produces a 400 bad request from the httpclient side.

import urllib
from tornado import httpclient

url = "https://blogs.msdn.com/b/jmeier/archive/2012/05/13/the-rapid-research-method.aspx?utm_source=feedburner&utm_medium=feed&utm_campaign=Feed: jmeier (J.D. Meier's Blog)&Redirected=true"

fh = urllib.urlopen(url)
# This will load up the content just peachy...
content = fh.read()

# This will get me a 400 bad request response.
http = httpclient.HTTPClient()
try:
    response = http.fetch(url)
    print "Content should be in here."
except Exception, e:
    print "but it goes BOOM!"
bdarnell commented 12 years ago

Technically that url is invalid because urls are not supposed to contain spaces. Browsers and other HTTP clients tend to "helpfully" rewrite the invalid url, although I'm not sure if there are any rules as to the right way to do it (it's not as simple as urllib.quote, since you don't want to encode the ampersands and other characters that would normally be percent-encoded)

mitechie commented 12 years ago

Yea, I was trying to find some way to break it apart and manually escape it, but when it worked in urllib I wondered if maybe this is something that could be picked up and ported or something.

ghost commented 10 years ago

Under the covers, urllib.urlopen() calls urllib.quote() with a safe character set of "%/:=&?~#+!$,;'@()*[]|". Is it worth adding that "helpfulness" to Tornado's httpclient? If that type of magic is better handled outside Tornado, perhaps this issue can be closed.

bdarnell commented 10 years ago

Is there any documentation of what browsers do (or should do) here? (maybe in html5?) If there's a standard to follow then I'm OK with adding that to Tornado, but I'd rather not add a bit of copy/pasted magic that may or may not be the same as what's used elsewhere.

kinow commented 5 years ago

(coming from a Java background, so 'scuse me if I say anything too silly here)

I think at least in Java most libraries try to follow the standards, but also open some exceptions to be more lenient on what users type, as well as sometimes platform/browser specific hacks.

Maybe related to Jon Postel's "be liberal in what you accept, and conservative in what you send"... in this case it appears Tornado HTTP Clients is in a tough position of being at both ends, receiving URL's from users that may be broken, while having to send it to the server...

I think it could be solved providing documentation/example to users on how to escape their URL's prior to calling Tornado?

This updated example gives me similar error output, just with Python 3 and minor changes to print/import statements.

import urllib.request

from tornado import httpclient

url = "https://blogs.msdn.com/b/jmeier/archive/2012/05/13/the-rapid-research-method.aspx?utm_source=feedburner&utm_medium=feed&utm_campaign=Feed: jmeier (J.D. Meier's Blog)&Redirected=true"

fh = urllib.request.urlopen(url)
# This will load up the content just peachy...
content = fh.read()

# This will get me a 400 bad request response.
http = httpclient.HTTPClient()
try:
    response = http.fetch(url)  # type: httpclient.HttpResponse
    print("Content should be in here.")
except Exception as e:
    print("but it goes BOOM!")

But this one with urllib.parse.quote works

import urllib.request
import urllib.parse

from tornado import httpclient

url = "https://blogs.msdn.com/b/jmeier/archive/2012/05/13/the-rapid-research-method.aspx?utm_source=feedburner&utm_medium=feed&utm_campaign=Feed: jmeier (J.D. Meier's Blog)&Redirected=true"

safe_with_percent = "!#$&'()*+,/:;=?@[]~"
url_clean = urllib.parse.quote(url, safe_with_percent)

fh = urllib.request.urlopen(url_clean)
# This will load up the content just peachy...
content = fh.read()

# This will get me a 400 bad request response.
http = httpclient.HTTPClient()
try:
    response = http.fetch(url_clean)  # type: httpclient.HttpResponse
    print("Content should be in here.")
except Exception as e:
    print("but it goes BOOM!")

The resulting value of url_clean for me is: 'https://blogs.msdn.com/b/jmeier/archive/2012/05/13/the-rapid-research-method.aspx?utm_source=feedburner&utm_medium=feed&utm_campaign=Feed:%20jmeier%20(J.D.%20Meier\\'s%20Blog)&Redirected=true', which works with urllib, requests, and with my browser (redirect)

The escape value in the working code is partially (there's another fallback value) copied from the requests library. From requests.utils.requote, which calls urllib.parse.quote.

Alternatively, it could be possible incorporate some escape value like that one from the requests library in Tornado. It looks to me like the HTTPClient's already have a dependency on urllib, but it is just not calling the quote function before using the URL - I think.

bdarnell commented 5 years ago

Just adding a call to urllib.parse.quote would break things if the input URL were already correctly encoded. It's very important that being liberal in accepting not-quite-correct input doesn't lead to handling correct input incorrectly. I think this probably means copying the whole requote method instead of just its use of quote().

My only hesitation is about copy/pasting strings of magic without authoritative references. In the years since this issue was opened, the WHATWG has published a standard for URLs-in-the-real-world (https://url.spec.whatwg.org/) which is what we should be following here.

kinow commented 5 years ago

Good point on the danger of simply calling urllib.parse.quote twice.

I think this probably means copying the whole requote method instead of just its use of quote().

That would be a small portion of code copied, I think that'd be OK...

My only hesitation is about copy/pasting strings of magic without authoritative references. In the years since this issue was opened, the WHATWG has published a standard for URLs-in-the-real-world (https://url.spec.whatwg.org/) which is what we should be following here.

And here's where I got stuck. The WHATWG seems quite broad, but also a bit hard to find exactly which character to use in that requote ported function.

There's a library, whatwg-url, that correctly parses the URL from this issue, and appears to based on WHATWG standard, though I hadn't heard about it until searching online.

>>> import whatwg_url
>>> u = "https://blogs.msdn.com/b/jmeier/archive/2012/05/13/the-rapid-research-method.aspx?utm_source=feedburner&utm_medium=feed&utm_campaign=Feed: jmeier (J.D. Meier's Blog)&Redirected=true"
>>> parseresult = whatwg_url.urlparse(u)
>>> parseresult
ParseResult(scheme='https', netloc='blogs.msdn.com', path='/b/jmeier/archive/2012/05/13/the-rapid-research-method.aspx', params='', query='utm_source=feedburner&utm_medium=feed&utm_campaign=Feed:%20jmeier%20(J.D.%20Meier%27s%20Blog)&Redirected=true', fragment='')
>>> parseresult.geturl()
'https://blogs.msdn.com/b/jmeier/archive/2012/05/13/the-rapid-research-method.aspx?utm_source=feedburner&utm_medium=feed&utm_campaign=Feed:%20jmeier%20(J.D.%20Meier%27s%20Blog)&Redirected=true'
>>> 

I guess we probably don't want to add a library just for that... but also it's not easy (for me at least) to find which characters are considered safe (with/out percent) for the requote function :confused: