jjlee / mechanize

Stateful programmatic web browsing in Python, after Andy Lester's Perl module WWW::Mechanize .
http://wwwsearch.sourceforge.net/mechanize/
618 stars 121 forks source link

Browser.retrieve, original filename and incomplete httplib.HTTPMessage RFC822 header parsing #36

Open jrjsmrtn opened 13 years ago

jrjsmrtn commented 13 years ago

I had some issues with Browser.retrieve and original filenames, at least in Python 2.6:

  1. Browser.retrieve(someurl) returns a (tmp_filename, httplib.HTTPMessage), with a temporary filename from tempfile.mkstemp;
  2. Browser.retrieve(someurl, filename) returns a (filename, httplib.HTTPMessage);
  3. but there's no way tho get the original filename, even if it's present in the 'Content-disposition: attachment; filename="abcd.xyz"' httplib.HTTPMessage header.

That's not really mechanize's fault: to extract those header parameters, httplib.HTTPMessage is missing a crucial 'get_filename' or a more generic 'get_param' methods, that are both present in the email.message.Message class. httplib.HTTPMessage has indeed a 'getparam' method, but unfortunately, it's only used/usable for 'content-type' header parsing.

I submitted an issue on the Python tracker (http://bugs.python.org/issue11316) and proposed a 'monkeypatch_http_message' decorator as a workaround, so we can do:

import mechanize 
from some.module import monkeypatch_http_message 

browser = mechanize.Browser() 
(tmp_filename, headers) = browser.retrieve(someurl) 

# monkeypatch the httplib.HTTPMessage instance 
monkeypatch_http_message(headers) 

# yeah... my original filename, finally 
filename = headers.get_filename() 

Once again, that's the situation in Python 2.6. According to http://bugs.python.org/issue4773, httplib.HTTPMessage in Python 3.x is using email.message.Message underneath.

(ps: this is an edited repost of issue 35, that I closed by mistake...)