kootenpv / sky

:sunrise: next generation web crawling using machine intelligence
BSD 3-Clause "New" or "Revised" License
329 stars 44 forks source link

sky is a web scraping framework, implemented with the latest python versions in mind (3.5+). It uses the asynchronous asyncio framework, as well as many popular modules and extensions.

Most importantly, it aims for next generation web crawling where machine intelligence is used to speed up the development/maintainance/reliability of crawling.

It mainly does this by considering the user to be interested in content from domains, not just a collection of single pages (templating approach).

See it live in action with a news website YOU propose:

Demo

Note that the following is only meant as a demo of some kind of app that could be built upon the scraping framework.

Make no mistake: the goal is to provide a smart-scraper, not some ugly UI.

Run:

The demo uses a standard configuration that can easily be improved on when setting up a project.



Similar data (title, body, publish_date, images etc) will be very easy to use in your own applications.

Features/Goals

These are the features/goals of sky. Checkmarks have been accomplished:

Installation

Use pip to install sky:

pip3 install -U sky

This will install only the required packages. Storing data on the local system does not require any other packages.

To store data, the following optional backends are currently available: elasticsearch, cloudant and ZODB.

Using the package

To setup a project/crawling service, visit this readme for a "Getting started".

Contribute

It is very much appreciated if you'd like to contribute in one or more of the following areas:

Templating approach

By considering crawl content to originate from a domain, rather than individual pages, the following willl be possible: