Begin ETL process - Githubissues

ploomber / sql

https://ploomber-sql.readthedocs.io

Apache License 2.0

16 stars 5 forks source link

Begin ETL process #63

Closed lfunderburk closed 1 year ago

lfunderburk commented 1 year ago

packaging-content

This script is currently extracting data sources from multiple sources, cleaning and saving into a csv

https://github.com/lfunderburk/fuel-electric-hybrid-vehicle-ml/blob/main/src/data/data_extraction.py

Goal 1 The goal is to modify this script so that it only extracts, cleans and saves data from the CO2 vehicle emissions data source

Link: https://open.canada.ca/data/en/dataset/98f1a129-f628-4ce4-b24d-6f16bf24dd64 License (open and well defined): https://open.canada.ca/en/open-government-licence-canada

Goal 2

Start identifying how the script works:

What dependencies are needed
What folder structure does it assume
What are the files it obtains
Suggest methods to improve the workflow it has

Goal 3

Modify script, incorporate jupysql to save the clean data into a database (sqlite, duckdb)

lfunderburk commented 1 year ago

This is blocked by https://github.com/ploomber/sql/issues/64

Please hold until this is resolved before contributing.

You can start working towards goals 1 and 2

lfunderburk commented 1 year ago

Added directory structure and script to main

You can execute from sql/pipeline

via the command

python src/datadownload.py

jpjon commented 1 year ago

For Goal 1, do you mean adjusting the data download script to only output something like 1995_today_vehicle_fuel_consumption.csv?

In that case, ignore the electric and hyprid data sets and download all other files called Fuel Consumption Ratings and concatenate/clean them?

lfunderburk commented 1 year ago

1995_today_vehicle_fuel_consumption.csv

has all the merged and cleaned files (Fuel Consumption Ratings)

the processed folder has the cleaned data for the three kinds of vehicles

lfunderburk commented 1 year ago

The task could be do decide:

Do we want to merge hybrid to electric to fuel only
Do we want three tables (one for each)

What are your thoughts?

jpjon commented 1 year ago

I agree with point 2. I guess my question is that for Goal 1, it seems that we are already doing that with the datadownload.py script.

The script only references this data source: https://open.canada.ca/data/en/dataset/98f1a129-f628-4ce4-b24d-6f16bf24dd64

Can you help me understand what needs to be modified in the script?

lfunderburk commented 1 year ago

Instead of saving three csv files, ideally we get it to load tables to a database

We could still have a file for data download that saves to csv But then we'd need a Jupyter notebook for creating and loading the tables

ideally we have one script

lfunderburk commented 1 year ago

The idea is we will be able to schedule this on an ec2 instance with a docker container so the cleaner the set up is the better

jpjon commented 1 year ago

I see, got it. I'll look into how the script will load the data in three separate databases in DuckDB.

lfunderburk commented 1 year ago

Thank you. Assigned you on Asana