python / cpython

The Python programming language
https://www.python.org
Other
63.51k stars 30.42k forks source link

urllib.urlopen.geturl() and redirects #36952

Closed doko42 closed 22 years ago

doko42 commented 22 years ago
BPO 588714
Nosy @mwhudson, @doko42

Note: these values reflect the state of the issue at the time it was migrated and might not reflect the current state.

Show more details

GitHub fields: ```python assignee = None closed_at = created_at = labels = ['invalid', 'library'] title = 'urllib.urlopen.geturl() and redirects' updated_at = user = 'https://github.com/doko42' ``` bugs.python.org fields: ```python activity = actor = 'jhylton' assignee = 'jhylton' closed = True closed_date = None closer = None components = ['Library (Lib)'] creation = creator = 'doko' dependencies = [] files = [] hgrepos = [] issue_num = 588714 keywords = [] message_count = 4.0 messages = ['11757', '11758', '11759', '11760'] nosy_count = 3.0 nosy_names = ['mwh', 'jhylton', 'doko'] pr_nums = [] priority = 'normal' resolution = 'not a bug' stage = None status = 'closed' superseder = None type = None url = 'https://bugs.python.org/issue588714' versions = ['Python 2.2'] ```

doko42 commented 22 years ago

[From http://bugs.debian.org/146408]

From: Matthew Vernon \matthew@pick.ucam.org\ Subject: python2.2: urllib.urlopen.geturl() fails to deal with redirects properly

urllib.urlopen.geturl() claims: "

The geturl() method returns the real URL of the page. In some cases, the HTTP server redirects a client to another URL. The urlopen() function handles this transparently, but in some cases the caller needs to know which URL the client was redirected to. The geturl() method can be used to get at this redirected URL.

But it appears not to:

>> urllib.urlopen("http://www.google.com/search?q=test&btnI=I'm+Feeling+Lucky").geturl() "http://www.google.com/search?q=test&btnI=I'm+Feeling+Lucky"

Doing the same by steam:

HEAD http://www.google.com/search?q=test&btnI=I'm+Feeling+Lucky HTTP/1.1 Host: www.google.com

HTTP/1.0 302 Moved Temporarily Content-Length: 151 Server: GWS/2.0 Date: Thu, 09 May 2002 16:51:37 GMT Location: http://www.toefl.org/ Content-Type: text/html

mwhudson commented 22 years ago

Logged In: YES user_id=6656

Something even wierder happens when I try urllib2:

>> urllib2.urlopen("http://www.google.com/search?q=test&btnI=I'm+Feeling+Lucky").geturl()

Traceback (most recent call last):
  File "<stdin>", line 1, in ?
  File "/home/mwh/src/python/dist/src/Lib/urllib2.py", line
136, in urlopen
    return _opener.open(url, data)
  File "/home/mwh/src/python/dist/src/Lib/urllib2.py", line
324, in open
    '_open', req)
  File "/home/mwh/src/python/dist/src/Lib/urllib2.py", line
303, in _call_chain
    result = func(*args)
  File "/home/mwh/src/python/dist/src/Lib/urllib2.py", line
792, in http_open
    return self.do_open(httplib.HTTP, req)
  File "/home/mwh/src/python/dist/src/Lib/urllib2.py", line
786, in do_open
    return self.parent.error('http', req, fp, code, msg, hdrs)
  File "/home/mwh/src/python/dist/src/Lib/urllib2.py", line
350, in error
    return self._call_chain(*args)
  File "/home/mwh/src/python/dist/src/Lib/urllib2.py", line
303, in _call_chain
    result = func(*args)
  File "/home/mwh/src/python/dist/src/Lib/urllib2.py", line
402, in http_error_default
    raise HTTPError(req.get_full_url(), code, msg, hdrs, fp)
urllib2.HTTPError: HTTP Error 403: Forbidden

(sf is going to mangle that traceback, I can tell).

03bde425-37ce-4291-88bd-d6cecc46a30e commented 22 years ago

Logged In: YES user_id=31392

The body of the error message is interesting. Google is explicitly refusing to serve requests issues by urllib and urllib2. It appears to be keying on the User-Agent field.

\<HTML>\<HEAD>\<TITLE>403 Forbidden\</TITLE>\</HEAD> \<BODY>\<H1>403 Forbidden\</H1> Your client does not have permission to get URL \<code>/search?q=test&btnI=I'm+Feeling+Lucky\</code> from this server. (Client IP address: 208.251.201.35)\<BR>\<BR> Please see Google's Terms of Service posted at http://www.google.com/terms_of_service.html \<BR>\<BR>\<P>If you believe that you have received this response in error, please send email to \<A href="mailto:forbidden@google.com">forbidden@google.com\</A>. Before sending this email, however, please make sure to take a look at our Terms of Service (http://www.google.com/terms_of_service.html).In your email, please send us the \<b>entire\</b> code displayed below. Please also send us any information you may know about how you are performing your Google searches-- for example, "I'm using the Opera browser on Linux to do searches from home. My Internet access is through a dial-up account I have with the FooCorp ISP." or "I'm using the Konqueror browser on Linux to search from my job at myFoo.com. My machine's IP address is 10.20.30.40, but all of myFoo's web traffic goes through some kind of proxy server whose IP address is 10.11.12.13." (If you don't know any information like this, that's OK. But this kind of information can help us track down problems, so please tell us what you can.)\</P>\<P>We will use all this information to diagnose the problem, and we'll hopefully have you back up and Googlin' again quickly!\</P> \<P>Please note that although we read all the email we receive, we are not always able to send a personal response to each and every email. So don't despair if you don't hear back from us!\</P> \<P>Also note that if you do not send us the \<b>entire\</b> code below, \<i>we will not be able to help you\</i>.\</P>\<P>Best wishes,\<BR>The Google Team\</BR>\</P>\<BLOCKQUOTE>/+/+/+/+/+/+/+/+/+/+/+/+/+/+/+/+/+/+/+/+/+/+/+/+/\<BR> AD1IFXbQ-8kjZGNiMTEAAGtHRVQgL3NlYXJjaD9xPXRlc3QmY\<BR> nRuST1JJ20rRmVlbGluZytMdWNreSBIVFRQLzEuMA0KSG9zdD\<BR> ogd3d3Lmdvb2dsZS5jb20NClVzZXItYWdlbnQ6IFB5dGhvbi1\<BR> 1cmxsaWIvMi4wYTENCrSY3UI=\<BR> +/+/+/+/+/+/+/+/+/+/+/+/+/+/+/+/+/+/+/+/+/+/+/+/+\<BR>\</BLOCKQUOTE>

\</BODY>\</HTML>

03bde425-37ce-4291-88bd-d6cecc46a30e commented 22 years ago

Logged In: YES user_id=31392

The original bug report was that geturl() returns the incorrect result. In this case, it has returned the correct URL because google did not redirect it. There is no Python bug here, so I trust the debian folks will close their bug report, too. The original poster should probably take up the issue with Google, or set a custom user-agent header.