scrapinghub / exporters

Exporters is an extensible export pipeline library that supports filter, transform and several sources and destinations
BSD 3-Clause "New" or "Revised" License
40 stars 10 forks source link

.. _Github repository: https://github.com/scrapinghub/exporters/

Exporters project documentation



Exporters provide a flexible way to export data from multiple sources to
multiple destinations, allowing filtering and transforming the data.

This `Github repository`_ is used as a central repository.

Full documentation can be found here http://exporters.readthedocs.io/en/latest/

Getting Started
===============

Install exporters
-----------------

First of all, we recommend to create a virtualenv::

    virtualenv exporters
    source exporters/bin/activate

..

Installing::

    pip install exporters

..

Creating a configuration
------------------------

Then, we can create our first configuration object and store it in a file called config.json.
 This configuration will read from an s3 bucket and store it in our filesystem, exporting only
 the records which have United States in field country:

.. code-block:: javascript

   {
        "reader": {
            "name": "exporters.readers.s3_reader.S3Reader",
            "options": {
                "bucket": "YOUR_BUCKET",
                "aws_access_key_id": "YOUR_ACCESS_KEY",
                "aws_secret_access_key": "YOUR_SECRET_KEY",
                "prefix": "exporters-tutorial/sample-dataset"
            }
        },
        "filter": {
            "name": "exporters.filters.key_value_regex_filter.KeyValueRegexFilter",
            "options": {
                "keys": [
                    {"name": "country", "value": "United States"}
                ]
            }
        },
        "writer":{
            "name": "exporters.writers.fs_writer.FSWriter",
            "options": {
                "filebase": "/tmp/output/"
            }
        }
   }

Export with script
------------------

We can use the provided script to run this export:

.. code-block:: shell

    python bin/export.py --config config.json

Use it as a library
-------------------

The export can be run using exporters as a library:

.. code-block:: python

    from exporters import BasicExporter

    exporter = BasicExporter.from_file_configuration('config.json')
    exporter.export()

Resuming an export job
----------------------

Let's suppose we have a pickle file with a previously failed export job. If we want to resume it
we must run the export script:

.. code-block:: shell

    python bin/export.py --resume pickle://pickle-file.pickle