jjlee / mechanize

Stateful programmatic web browsing in Python, after Andy Lester's Perl module WWW::Mechanize .
http://wwwsearch.sourceforge.net/mechanize/
618 stars 121 forks source link

Plan ongoing support for this library or mark it deprecated here & on PyPI #117

Closed jamesbroadhead closed 7 years ago

jamesbroadhead commented 7 years ago

Sadly, although this library has many users, is linked-to from many tutorials and has many active or semi-active forks, this primary branch is unmaintained & there hasn't been a release on PyPI for 5 years.

Significantly, there is no python3 support in this branch (although there are forks with implementations). Meanwhile, there are other, similar libraries available (such as MechanicalSoup) which are python3-compatible, and are being actively maintained. This project has quite high google-ability, so we should act so that new developers either find an alive project, or are warned about the current situation before they start using mechanize.

I have contacted @jjlee & he is willing to pass the torch if someone is willing to step forward and be the primary maintainer.

We should work together to decide what the future of mechanize will be. It seems to me that the options we have are:

  1. find a maintainer, merge the forks & work on a python3-compatible release.
  2. publicly mark this library as deprecated & start migrating projects to use other libraries

I am sending a link to this Issue to everyone who has contributed to all forks of mechanize here on github. Please discuss & vote here (or suggest other alternatives)

tachang commented 7 years ago

Thanks for taking the lead on this! I don't really have much to add but love to see open source projects live on.

kovidgoyal commented 7 years ago

I am happy to maintain mechanize, the biggest single change that is needed is the ability to use custom HTML parsers and change the default parser to html5lib. That will fix a whole load of bugs when using mechanize to log into modern websites. Another good feature to have is the ability to clone mechanize browser instances so that they can be used from multiple threads while sharing the same cookie jar (I already have this implemented using an unlovely hack, but it would be good to implement it properly). And finally, mechanize really needs better documentation.

I dont care about python 3 support myself, but I am willing to work with someone who does.

I have limited time to devote to it, user support in particular is going to be difficult for me, but hopefully if we can get a community going, that will help.

As for my qualifications -- I maintain several open source python 2 and python 3 projects (see my github profile), one of which, calibre, uses mechanize extensively and is itself used by millions of people.

kovidgoyal commented 7 years ago

Oh and I already have a fork of mechanize https://github.com/kovidgoyal/mechanize

kovidgoyal commented 7 years ago

Another big feature would be adding support for keep-alive in http 1.1 (and perhaps http 2 eventually?) which would greatly improve performance for many common scraping tasks. However, doing this would be a pretty invasive change since IIRC mechanize inherits the request/response design from urllib2 and that assumption is baked into its guts pretty deeply. Probably would require dropping backwards compat.

hickford commented 7 years ago

Two Python 3-compatible alternatives to mechanize are https://github.com/jmcarp/robobrowser and https://github.com/hickford/MechanicalSoup , both based on requests and BeautifulSoup

jamesbroadhead commented 7 years ago

I've emailed @jjlee & @kovidgoyal to move this forward. I've also messaged github to enquire if the "mechanize" account qualifies as dormant for re-use.

jjlee commented 7 years ago

Thanks Kovid and sorry everybody for being an absent project owner! For a long time I intended to still do maintenance releases, but you know, other stuff took priority. Later, after so many years without any releases, I wasn't really aware it was still being used much... but I'm glad it's still useful to somebody!

I've added Kovid as an Owner of the PyPI project, and he's created https://github.com/python-mechanize -- the project will be moving there and at some point the code will get removed from this original fork, jjlee/mechanize, just to avoid confusion.

Obviously Kovid is well known to the Python community and I'm sure will do a great job, though I know he'll be looking for contributions from other people, having plenty on his plate already.

kovidgoyal commented 7 years ago

Thanks John. The code now lives at: https://github.com/python-mechanize/mechanize

As a first step, I am going to work on getting all tests to pass and setup CI testing on Travis.

jamesbroadhead commented 7 years ago

Super glad that we managed to figure this out -- just nudged all the open pull-requests. Marking this closed.

kovidgoyal commented 7 years ago

Just an update on what's been done so far:

* mechanize now requires python >= 2.7.0
* When processing cookies that have a blank (unset) path, assume the path
 is /. Mimics modern browser behavior.
* Support PyPy (added to continuous integration testing)
* Make the global urlopen/urlretrieve methods threadsafe
* Add support for user supplied CA certificates
* Support HTML 5 (all html is now parsed using html5lib)
* Backward incompatibility: The factory keyword argument to Browser is no longer allowed
* Backward incompatibility: Browser.forms() and Browser.links() return unicode strings instead of byte strings
* Backward incompatibility: When searching for a form control if more than one control matches, an AmbiguityError is always raised
* Backward incompatibility: There is no longer a mechanize.ParseError
  class. Parsing now uses the HTML 5 algorithm, which almost never fails.
* Backward incompatibility: For links that do not have any text the text
  attribute is now always an empty string instead of None or an empty string.

And in addition the size of the codebase has been decreased by some 13,000 lines.

There is basically only one thing left on my TODO list -- adding support for per-domain connection pools/keep-alive/http2. Thta should be fairly easy to do by leveragin an existing http2 library such as hyper.

kovidgoyal commented 7 years ago

I have released mechanize 0.3.0