Problem definition

Given a CSV, load the the content of the CSV into a remote Postgresql instance.

The CSV should be of the schema (title, article, publication date, source, query). Attach a unique ID article_ID when inserting the data.

The data should be sanitized before hand. Any values that are missing title, article or publication date should be dropped. If source or query are blank, they should be null padded. Any article that has already been inserted should be dropped, use title, date and source a key.

Proposed approach

Using Python, load in data using Pandas (pd.read_csv()).

Iterate through each row in the dataframe, clean the data, insert the data with checks to make sure its not a duplicate.

Write a sql statement to insert the tuple into the data base

Two functions: insert_tuple(tuple): ... code to insert tuple into db goes here ...

read_data(): ... code to return a dataframe with all the values in the db go here ...

samirchowd / NewsChain

Database.py #3

Problem definition

Proposed approach