jjlee / mechanize

Stateful programmatic web browsing in Python, after Andy Lester's Perl module WWW::Mechanize .
http://wwwsearch.sourceforge.net/mechanize/
618 stars 123 forks source link

Browser.retrieve, original filename and incomplete httplib.HTTPMessage RFC822 header parsing #35

Closed jrjsmrtn closed 13 years ago

jrjsmrtn commented 13 years ago

I had some issues with Browser.retrieve and original filenames:

  1. Browser.retrieve(someurl) returns a (tmp_filename, httplib.HTTPMessage), with a temporary filename from tempfile.mkstemp;
  2. Browser.retrieve(someurl, filename) returns a (filename, httplib.HTTPMessage);
  3. but there's no way tho get the original filename, even if it's present in the 'Content-disposition: attachment; filename="abcd.xyz"' httplib.HTTPMessage header.

That's not really mechanize's fault: to extract those header parameters, httplib.HTTPMessage is missing a crucial 'get_filename' or a more generic 'get_param' methods, that are both present in the email.message.Message class.

httplib.HTTPMessage has indeed a 'getparam' method, but unfortunately, it's only used/usable for 'content-type' header parsing.

I submitted an issue on the Python tracker (http://bugs.python.org/issue11316) and proposed a 'monkeypatch_http_message' decorator as a workaround, so we can do:

import mechanize
from some.module import monkeypatch_http_message
browser = mechanize.Browser()
(tmp_filename, headers) = browser.retrieve(someurl) 

# monkeypatch the httplib.HTTPMessage instance
monkeypatch_http_message(headers)

# yeah... my original filename, finally
filename = headers.get_filename()
jrjsmrtn commented 13 years ago

A precision: this is the situation for Python 2.6. It seems that httplib in Python 3 is using email.message.Message underneath.

jrjsmrtn commented 13 years ago

ooops... closed by error :-|