opendatatrentino / opendata-harvester

Harvester for OpenData
BSD 2-Clause "Simplified" License
3 stars 0 forks source link

OpenData Harvester

This project, developed as part of the Open Data Trentino project, is a suite of tools to allow easy importing of batches of datasets from data providers to data catalogs.

Build status:

Branch Status
master Build Status
develop Build Status

Installation

Simply install the tarball from github:

pip install https://github.com/opendatatrentino/opendata-harvester/tarball/master

Or use the "vanity" url:

pip install https://git.io/harvester.tar.gz

if you plan to use it to import data to ckan, you'll need the Ckan API client too. To install the stable version from pypi:

pip install ckan-api-client

Or the latest from git:

pip install http://git.io/ckan-api-client.tar.gz

System dependencies

Several libraries are required to build dependencies. On debian:

apt-get install python-dev libxslt1-dev libxml2-dev

Concepts

This package will install a command-line script named harvester which can be used to perform all the needed operations.

The command is extensible by using entry points to provide additional plugins.

There are four plugin types that can be defined:

Core plugins

Storages:

Crawlers:

Converters:

Importers:

Example usage

Download data to MongoDB:

harvester -vvv --debug crawl \
    --crawler pat_statistica \
    --storage mongodb://database.local/harvester_data/statistica
harvester -vvv --debug crawl \
    --crawler pat_statistica_subpro \
    --storage mongodb://database.local/harvester_data/statistica_subpro

Prepare data for insertion into ckan:

harvester -vvv --debug convert \
    --converter pat_statistica_to_ckan \
    --input mongodb://database.local/harvester_data/statistica \
    --output mongodb://database.local/harvester_data/statistica_clean
harvester -vvv --debug convert \
    --converter pat_statistica_subpro_to_ckan \
    --input mongodb://database.local/harvester_data/statistica_subpro \
    --output mongodb://database.local/harvester_data/statistica_subpro_clean

Actually load data to Ckan:

harvester -vvv --debug import \
    --storage mongodb://database.local/harvester_data/statistica_clean \
    --importer ckan+http://127.0.0.1:5000 \
    --importer-option api_key=00112233-4455-6677-8899-aabbccddeeff \
    --importer-option source_name=statistica
harvester -vvv --debug import \
    --storage mongodb://database.local/harvester_data/statistica_subpro_clean \
    --importer ckan+http://127.0.0.1:5000 \
    --importer-option api_key=00112233-4455-6677-8899-aabbccddeeff \
    --importer-option source_name=statistica_subpro

Running with debugger

Use something like this:

pdb $( which harvester ) -vvv --debug ....