Speed data pipeline time and conserve network resources by caching external data sources

Context and Problem Statement

As designed, the data pipeline downloads and unpacks all external data sources with every pipeline run. This is because the original developers believed that the data pipeline's data sources, and subsequently the score, would change often. However, that's not the case. And in fact the opposite is true: there is a desire for the score to remain stable until new external data sources are vetted and incorporated into the product.

However, the code contains no mechanism to cache and reuse external data sources.

Decision Drivers

The downloaded data sources are approximately 5.6GB in size, and are downloaded with each pipeline run
A non trivial amount of time is spent downloading those data sources with each run of the data pipeline, adversely impacting developer productivity
Some data sources, such as the census bureau, throttle the number of requests over a time period; running the pipeline multiple times per day can get developers banned from making calls for 24 hours
Because data sources may change between runs, it's impossible to know if a code change impacted the score or if a data source changing impacted the score

Considered Options

Maintaining the status quo
Aggressively caching data sources by default with each run
Caching data sources in a well-known location inside the project when a flag is sent to application.py

Decision Outcome

We picked "Caching data sources in a well-known location inside the project when a flag is sent to application.py", because this option allows us to speed up our pipeline, reduce the required bandwidth, and maintain consistency between runs. By requiring a flag to turn on caching, this option has the aforementioned benefits without fundamentally changing the way the application works. We therefore won't have to update any GitHub Actions or change the way the build works; it does, however, give us the freedom to take advantage of the cache more broadly by adding the data sources to source control in the future (should we choose to do so).

usds / justice40-tool

ADR: Add ability to cache the data pipeline's external data sources #2168

Speed data pipeline time and conserve network resources by caching external data sources

Context and Problem Statement

Decision Drivers

Considered Options

Decision Outcome