ploomber / sql

https://ploomber-sql.readthedocs.io
Apache License 2.0
16 stars 5 forks source link

Identify data sources (api or github.raw data sources) that are updated periodically #62

Closed lfunderburk closed 1 year ago

lfunderburk commented 1 year ago

Try to identify a dataset that meets:

  1. [Strong requirement] Tabular data with a combination of numerical and categorical information
  2. [Optional but would be really nice] Look for larger datasets (at least 200,000 entries)
  3. [Strongly recommended, but if can't find no worries, lower frequency works too]. Look for datasets that are updated regularly (biweekly, weekly ideally)
  4. [Strong requirement] Datasets that are open - look into their licenses, the dataset should be free and can be used for educational purposes
lfunderburk commented 1 year ago

Context:

The goal of this is to design an ETL process with the dataset that we can use for the packaging, dashboard and deployment components of this course.

packaging-content

TegveerG commented 1 year ago

I have been working extensively with a Blockchain dataset from DEFIYIELD that periodically updates their REKT Database when a scam on any blockchain platform occurs. The scams are investigated by DEFIYIELD engineers for accuracy before imputing the respective record. They have an API that, although states users to get API keys, but I have used (without bearer/API tokens) it without any fuss. So far, the database has 3,427 rows and 35 columns. The raw dataset returned as a JSON. I also have an ipynb that cleans the data and keeps 11 columns for analysis. There are about 920 NaN values associated with the date column for which I replaced with "Unknown".

There are two numerical columns corresponding to funds lost and funds returned. I also extracted the month, day of week, and day of year to create more numerical columns (although it can be argued that month and day of week are categorical). Other columns are categorical and correspond to the platform name, exchange type, network type, and scam type.

Link to repo: https://github.com/TegveerG/ETL-Data-Blockchain

lfunderburk commented 1 year ago

Candidate two: Vehicle fuel consumption dataset

Link: https://open.canada.ca/data/en/dataset/98f1a129-f628-4ce4-b24d-6f16bf24dd64 License (open and well defined): https://open.canada.ca/en/open-government-licence-canada Content: information about CO2 emissions, fuel consumption for three kinds of vehicles (fuel based, hybrid and electric) Updated: once a month Number of records as of today: ~27500 Note There are lots of open datasets on that page, usually with thousands of records.

Scripts for data download: this is a data download and data cleaning script I built to parse and handle the records. We could repurpose it for the ETL pipeline.

https://github.com/lfunderburk/fuel-electric-hybrid-vehicle-ml/blob/main/src/data/data_extraction.py

jpjon commented 1 year ago

I like the vehicle fuel consumption dataset because it requires very little domain knowledge. It also seems to indefinitely not require an API key when requests are called.

TegveerG commented 1 year ago

Yea agreed @jpjon I like the emissions dataset as well, would be easier to get up to speed with it. Although I could only see that they update it "as needed", maybe @lfunderburk since you have worked with it would know about the monthly periodicity. Another data source I thought of was NOAA Daily Summaries that we could collect for NYC for example. Not the best option because it only has binary categorical variables. It also requires an API key and I have the script for it, but the emissions data seems like the best out of all the three discussed.

lfunderburk commented 1 year ago

Ok closing this issue Working with the CO2 emissions dataset