niklasb / dryscrape

[not actively maintained] A lightweight Python library that uses Webkit to enable easy scraping of dynamic, Javascript-heavy web pages
http://dryscrape.readthedocs.io/
MIT License
533 stars 67 forks source link

NOTE: This package is not actively maintained. It uses QtWebkit, which is end-of-life and probably doesn't get security fixes backported. Consider using a similar package like Spynner instead.

Overview

Author: Niklas Baumstark

dryscrape is a lightweight web scraping library for Python. It uses a headless Webkit instance to evaluate Javascript on the visited pages. This enables painless scraping of plain web pages as well as Javascript-heavy “Web 2.0” applications like Facebook.

It is built on the shoulders of capybara-webkit's webkit-server. A big thanks goes to thoughtbot, inc. for building this excellent piece of software!

Changelog

Supported Platforms

The library has been confirmed to work on the following platforms:

Other unixoid systems should work just fine.

Windows is not officially supported, although dryscrape should work with cygwin.

A word about Qt 5.6

The 5.6 version of Qt removes the Qt WebKit module in favor of the new module Qt WebEngine. So far webkit-server has not been ported to WebEngine (and likely won't be in the near future), so Qt <= 5.5 is a requirement.

Installation, Usage, API Docs

Documentation can be found at dryscrape's ReadTheDocs page.

Quick installation instruction for Ubuntu:

# apt-get install qt5-default libqt5webkit5-dev build-essential python-lxml python-pip xvfb
# pip install dryscrape

Contact, Bugs, Contributions

If you have any problems with this software, don't hesitate to open an issue on Github or open a pull request or write a mail to niklas baumstark at Gmail.