joelansbro / pipeline

API Pipeline DB middleware
2 stars 0 forks source link

pipeline

API Pipeline DB middleware

This repository forms the backbone pipeline of the application, forming the pipeline between Extracting Transforming and Loading of the scraped data.

This path includes:

Currently runs a task queue through different stages of jobs, before finally outputting over to an SQLite database

Presentation Layer - inboundAPI and outboundAPI send jobs to celeryBroker Business Logic Layer - processing > intakejobs clean inbound articles, reportjob outputs generic reports Persistence Layer - databaseConn > DAO

Release Notes


Run Dev Environ

The following script will run the whole pipeline, including required endpoints

bash runjob.sh


Celery Broker

Celery is the task scheduler that sets the flow of data through various tasks and jobs.

py -m celery --app celeryBroker worker --loglevel=INFO -B -s ./data/beat.schedule

Flask

The inbound and outbound API are generated from their respective py files. However, it is recommended to run the bash script to start both


Test Scripts

This script will pass dummy data into the API

py testinboundapi.py

This script will activate the pipeline queue, processing json data from ./data/stash

py testscheduler.py

This script will generate a request for data against the 'None' default group. It will output a csv file in the local folder:

py testreport.py


SQL Debugging:

SQLite CLI tool

sqlite3 <dbname>

sqlbrowser

sqlitebrowser

Technologies and Libraries:

TODO: Wrap up dependencies within Docker


Schema for the SQLite database table 'articles':

    "id"    INTEGER NOT NULL UNIQUE,
    "title" TEXT NOT NULL,
    "author"    TEXT NOT NULL,
    "project"   TEXT NOT NULL,
    "date_published"    TEXT,
    "lead_image_url"    TEXT,
    "content"   TEXT NOT NULL,
    "next_page_url" TEXT,
    "url"   TEXT NOT NULL,
    "domain"    TEXT,
    "excerpt"   TEXT,
    "word_count"    INTEGER,
    "direction" TEXT,
    "total_pages"   INTEGER,
    "rendered_pages"    TEXT,
    "keywords"  TEXT,
    PRIMARY KEY("id" AUTOINCREMENT)