sfbrigade / datasci-earthquake

MIT License
0 stars 2 forks source link

Data Cleaning Pipeline #12

Open oscarsyu opened 6 days ago

oscarsyu commented 6 days ago

Context

We are hoping to automatically ingest our datasets in from sources (when possible and appropriate). This task is to do data quality validation to identify existing issues, and handle any possible future ones. The owner of this task will be responsible for creating a process that validates and corrects errors so that a future automated call to the data source will result in usable and quality data to be displayed to users. Datasets are here

Additional Info here

Definition of Done

Engineering Details

mackcooper1408 commented 16 hours ago

This may need to be split up into a research issue and an implementation issue (first lays out exactly what is dirty and second lays out the code to fix it).

I also think we may need to build the ETL pipeline first... or maybe have the research half of this be part of building out the ETL pipeline? Otherwise there's just not anywhere to put this code...