usds / justice40-tool

A tool to identify disadvantaged communities due to environmental, socioeconomic and health burdens
https://screeningtool.geoplatform.gov/
Creative Commons Zero v1.0 Universal
125 stars 42 forks source link

ADR: Add ability to cache the data pipeline's external data sources #2168

Closed travis-newby closed 1 year ago

travis-newby commented 1 year ago

Speed data pipeline time and conserve network resources by caching external data sources

Context and Problem Statement

As designed, the data pipeline downloads and unpacks all external data sources with every pipeline run. This is because the original developers believed that the data pipeline's data sources, and subsequently the score, would change often. However, that's not the case. And in fact the opposite is true: there is a desire for the score to remain stable until new external data sources are vetted and incorporated into the product.

However, the code contains no mechanism to cache and reuse external data sources.

Decision Drivers

Considered Options

Decision Outcome

We picked "Caching data sources in a well-known location inside the project when a flag is sent to application.py", because this option allows us to speed up our pipeline, reduce the required bandwidth, and maintain consistency between runs. By requiring a flag to turn on caching, this option has the aforementioned benefits without fundamentally changing the way the application works. We therefore won't have to update any GitHub Actions or change the way the build works; it does, however, give us the freedom to take advantage of the cache more broadly by adding the data sources to source control in the future (should we choose to do so).