Rearranged some code so that when the date changes the most recent chunk of data is updated, not the entire dataset. Rather than keeping the up-to-date daily dataset as a target called daily, the data is written to disk using arrow::write_dataset(). The data is partitioned by year so that when the pipeline pulls down new data from the API, only a file holding the most recent year of data is overwritten. This idea could be extended to partition the data by yearmonth to do even less overwriting.
I've also added documentation to some of the functions, and renamed targets
legacy_ now refers to historical data scraped from the AZMet website, not available through the API
past_ refers to data through october 2022. Joining the legacy data to some more recent data was my way of ensuring the old and new data are harmonized
db_daily targets are just pointers to /data/daily/ where the paritioned dataset gets written to
Rearranged some code so that when the date changes the most recent chunk of data is updated, not the entire dataset. Rather than keeping the up-to-date daily dataset as a target called
daily
, the data is written to disk usingarrow::write_dataset()
. The data is partitioned by year so that when the pipeline pulls down new data from the API, only a file holding the most recent year of data is overwritten. This idea could be extended to partition the data by yearmonth to do even less overwriting.I've also added documentation to some of the functions, and renamed targets
legacy_
now refers to historical data scraped from the AZMet website, not available through the APIpast_
refers to data through october 2022. Joining the legacy data to some more recent data was my way of ensuring the old and new data are harmonizeddb_daily
targets are just pointers to/data/daily/
where the paritioned dataset gets written to