API Pipeline DB middleware
This repository forms the backbone pipeline of the application, forming the pipeline between Extracting Transforming and Loading of the scraped data.
This path includes:
Currently runs a task queue through different stages of jobs, before finally outputting over to an SQLite database
Presentation Layer - inboundAPI and outboundAPI send jobs to celeryBroker Business Logic Layer - processing > intakejobs clean inbound articles, reportjob outputs generic reports Persistence Layer - databaseConn > DAO
The following script will run the whole pipeline, including required endpoints
bash runjob.sh
Celery is the task scheduler that sets the flow of data through various tasks and jobs.
py -m celery --app celeryBroker worker --loglevel=INFO -B -s ./data/beat.schedule
The inbound and outbound API are generated from their respective py files. However, it is recommended to run the bash script to start both
This script will pass dummy data into the API
py testinboundapi.py
This script will activate the pipeline queue, processing json data from ./data/stash
py testscheduler.py
This script will generate a request for data against the 'None' default group. It will output a csv file in the local folder:
py testreport.py
SQL Debugging:
SQLite CLI tool
sqlite3 <dbname>
sqlbrowser
sqlitebrowser
Technologies and Libraries:
TODO: Wrap up dependencies within Docker
Schema for the SQLite database table 'articles':
"id" INTEGER NOT NULL UNIQUE,
"title" TEXT NOT NULL,
"author" TEXT NOT NULL,
"project" TEXT NOT NULL,
"date_published" TEXT,
"lead_image_url" TEXT,
"content" TEXT NOT NULL,
"next_page_url" TEXT,
"url" TEXT NOT NULL,
"domain" TEXT,
"excerpt" TEXT,
"word_count" INTEGER,
"direction" TEXT,
"total_pages" INTEGER,
"rendered_pages" TEXT,
"keywords" TEXT,
PRIMARY KEY("id" AUTOINCREMENT)