travis-newby commented 1 year ago

This change adds data set caching to the ETL pipeline, saving the downloading and unpacking of over 5.6GB of data per pipeline run! (See: related ADR.)

Usage

Callers of certain features in application.py now have the option to use cached data for ETLs. Simply pass --use-cache or -u, and the system will use cached data if it exists. Those features include,

census-data-download
etl-run
score-full-run
data-full-run

For example, poetry run python3 data_pipeline/application.py etl-run -u.

Callers can also: pre-fetch the data sources by calling application.py with extract-data-sources; clear cached data sources by calling application.py with clear-data-source-cache; and print the data sources by calling application.py with print-data-sources.

Cached files are stored in a new folder called data_pipeline/data/sources. This involved a fair bit of standardization work in the ETLs, because not all files were stored consistently or in a single location prior to this change (and some files were used directly from memory after being downloaded).

The caching algorithm is pretty simple: if there's something in the cache folder for the ETL, the system tries to use that something. If not, the ETL will likely fail. To refresh an ETLs data source, all you have to do is call extract-data-sources on that ETL without the caching option.

Why not make caching the default?

I don't want this change to impact the Github Actions or anyones existing workflow. I do think this may be an incremental change to caching more broadly, but that those iterations should be done as a separate PR. This one is already big enough.

We may decide, in the future, to check the data_pipeline/data/sources folder into source control, or to allow the folder to be downloaded in zipped form from S3; this will give us consistency between every pipeline run, and allow us to vet and incorporate new data sources while understanding their impact on the score. In that case, caching would likely be turned on by default (if you can't tell, this would be my ultimate recommendation, with this PR simply introducing the caching mechanism into the code to be better used later).

But what about the results of the pipeline?

I think we're good! I took time to diff the results of main vs the results of this branch. Specifically, I looked at,

data_pipeline/data/dataset to make sure the ETL results are consistent. They are.
data_pipeline/data/score/csv, downloadable, and geojson to make sure those are consistent. They are.
data_pipeline/data/score/shapefile to make sure those are consistent. They are with a single exception: an extra "newline" character is in the .dbf file of my branch. I have no idea why, but it doesn't seem to impact anything.
data_pipeline/data/score/tiles to make sure those are consistent. The tiles are the same.

I don't have the ability to diff binary files, so a few of those were skipped. But the important stuff, like the CSV files and the tiles themselves, are identical.

And the tests? Bet you forgot about those...

I didn't forget! Previously, because there was no concept of caching, tests used a variety of directories to store their temporary contents. Now, like in the main application, we've standardized on a location for mocked test downloads (you can find it under data_pipeline/data/tmp/tests). Due to the way the tests work (magic constants!), this mean mocking the get_sources_path call in the ETL base class, and modifying a few tests to use that path. I believe I got all this right, and the tests run and pass with everything ending up in the right place, but would love a second set of eyes.

Any other comments on the way the code is structured?

Yep! To make it easier to review, here are a few things to keep in mind.

I introduced a new concept to the ETLs called "data sources." Each ExtractTransformLoad subclass supplies a list of data sources it requires to run by implementing the get_data_sources method. The superclass implementation of extract automatically downloads those data sources, and child classes can call super().extract(...) to inherit that behavior.
Data sources are implemented in data_pipeline.etl.datasource.py and come in three flavors: a FileDataSource used to download single files, a ZIPDataSource used to download and extract zip files, and a CensusDataSource that calls the Census API. Each of those data sources knows how to retrieve its own content.
I also created a class called Downloader to isolate the downloading behavior for ZIP and regular files.
I tried to standardize the work that was performed in the extract methods. It was not at all consistent, with some ETLs performing almost no work in extract and some performing most of their work in extract. I'm not sure the balance is perfect yet, but it's better.

Thanks for making it this far. 🚀

github-actions[bot] commented 1 year ago

Score Deployed! Find it here:

github-actions[bot] commented 1 year ago

Map Deployed! Map with Staging Backend: https://screeningtool.geoplatform.gov/en?flags=stage_hash=2169/cb3ee53a2f8ea3d0a9c7ce7ffd2462ccc54e450e Find tiles here: https://justice40-data.s3.amazonaws.com/data-pipeline-staging/2169/cb3ee53a2f8ea3d0a9c7ce7ffd2462ccc54e450e/data/score/tiles

github-actions[bot] commented 1 year ago

Score Deployed! Find it here:

github-actions[bot] commented 1 year ago

Map Deployed! Map with Staging Backend: https://screeningtool.geoplatform.gov/en?flags=stage_hash=2169/d4e3b3090048745b8685da9e3173344b654bcebc Find tiles here: https://justice40-data.s3.amazonaws.com/data-pipeline-staging/2169/d4e3b3090048745b8685da9e3173344b654bcebc/data/score/tiles

github-actions[bot] commented 1 year ago

Score Deployed! Find it here:

github-actions[bot] commented 1 year ago

Map Deployed! Map with Staging Backend: https://screeningtool.geoplatform.gov/en?flags=stage_hash=2169/3f13966d0d67753e9e03dd98d041eaf052720c7a Find tiles here: https://justice40-data.s3.amazonaws.com/data-pipeline-staging/2169/3f13966d0d67753e9e03dd98d041eaf052720c7a/data/score/tiles

vim-usds commented 1 year ago

I'd like to re-run this with the score-comparator bot :)

travis-newby commented 1 year ago

I'd like to re-run this with the score-comparator bot :)

Me, too! Just kicked it off!

github-actions[bot] commented 1 year ago

Score Deployed! Find it here:

github-actions[bot] commented 1 year ago

Score Comparison Summary

Hi! I'm the Score Comparator. I compared the score in production (version 1.0) to the locally calculated score. Here are the results.

Columns

I compared the columns. Here's what I found.

There are no differences in the column names.

Scores

I compared the scores, too. Here's what I found.

The production score has 74,134 census tracts, and the freshly calculated score has 74,134. They match!
The total population in all census tracts in the production score is 328,267,709.0. The total population in all census tracts locally is 328,267,709.0. They match!
There are 27,248 disadvantaged tracts in the production score representing 33.3% of the total population, and 27,248 in the locally generated score representing 33.3% of the total population. The number of tracts match!
There are 0 tract(s) marked as disadvantaged in the production score that are not disadvantaged in the locally generated score (i.e. disadvantaged tracts that were removed by the new score). There are 0 tract(s) marked as disadvantaged in the locally generated score that are not disadvantaged in the production score (i.e. disadvantaged tracts that were added by the new score).
I compared all values across all census tracts. There are 0 tracts with at least one difference. Please examine the logs or run the score comparison locally to view them all.

travis-newby commented 1 year ago

Looks like this branch all the same deltas from the 1.0 score as main, which are small variances in the column Percent of the Census tract that is within Tribal areas (percentile).

github-actions[bot] commented 1 year ago

Map Deployed! Map with Staging Backend: https://screeningtool.geoplatform.gov/en?flags=stage_hash=2169/072861e9b3f38a8ba78a5045a7be32f48728cd70 Find tiles here: https://justice40-data.s3.amazonaws.com/data-pipeline-staging/2169/072861e9b3f38a8ba78a5045a7be32f48728cd70/data/score/tiles

travis-newby commented 1 year ago

Running this with the new comparator one last time, then I'll merge.

github-actions[bot] commented 1 year ago

Score Deployed! Find it here:

github-actions[bot] commented 1 year ago

Map Deployed! Map with Staging Backend: https://screeningtool.geoplatform.gov/en?flags=stage_hash=2169/c775a4257b80592a48eba33a99c2ddfbbb3a27dc Find tiles here: https://justice40-data.s3.amazonaws.com/data-pipeline-staging/2169/c775a4257b80592a48eba33a99c2ddfbbb3a27dc/data/score/tiles

usds / justice40-tool

Add ability to cache ETL data sources #2169

Usage

Why not make caching the default?

But what about the results of the pipeline?

And the tests? Bet you forgot about those...

Any other comments on the way the code is structured?

Score Comparison Summary

Columns

Scores