Closed travis-newby closed 1 year ago
Score Deployed! Find it here:
Map Deployed! Map with Staging Backend: https://screeningtool.geoplatform.gov/en?flags=stage_hash=2169/cb3ee53a2f8ea3d0a9c7ce7ffd2462ccc54e450e Find tiles here: https://justice40-data.s3.amazonaws.com/data-pipeline-staging/2169/cb3ee53a2f8ea3d0a9c7ce7ffd2462ccc54e450e/data/score/tiles
Score Deployed! Find it here:
Map Deployed! Map with Staging Backend: https://screeningtool.geoplatform.gov/en?flags=stage_hash=2169/d4e3b3090048745b8685da9e3173344b654bcebc Find tiles here: https://justice40-data.s3.amazonaws.com/data-pipeline-staging/2169/d4e3b3090048745b8685da9e3173344b654bcebc/data/score/tiles
Score Deployed! Find it here:
Map Deployed! Map with Staging Backend: https://screeningtool.geoplatform.gov/en?flags=stage_hash=2169/3f13966d0d67753e9e03dd98d041eaf052720c7a Find tiles here: https://justice40-data.s3.amazonaws.com/data-pipeline-staging/2169/3f13966d0d67753e9e03dd98d041eaf052720c7a/data/score/tiles
I'd like to re-run this with the score-comparator bot :)
I'd like to re-run this with the score-comparator bot :)
Me, too! Just kicked it off!
Score Deployed! Find it here:
Hi! I'm the Score Comparator. I compared the score in production (version 1.0) to the locally calculated score. Here are the results.
I compared the columns. Here's what I found.
I compared the scores, too. Here's what I found.
Looks like this branch all the same deltas from the 1.0 score as main, which are small variances in the column Percent of the Census tract that is within Tribal areas (percentile)
.
Map Deployed! Map with Staging Backend: https://screeningtool.geoplatform.gov/en?flags=stage_hash=2169/072861e9b3f38a8ba78a5045a7be32f48728cd70 Find tiles here: https://justice40-data.s3.amazonaws.com/data-pipeline-staging/2169/072861e9b3f38a8ba78a5045a7be32f48728cd70/data/score/tiles
Running this with the new comparator one last time, then I'll merge.
Score Deployed! Find it here:
Map Deployed! Map with Staging Backend: https://screeningtool.geoplatform.gov/en?flags=stage_hash=2169/c775a4257b80592a48eba33a99c2ddfbbb3a27dc Find tiles here: https://justice40-data.s3.amazonaws.com/data-pipeline-staging/2169/c775a4257b80592a48eba33a99c2ddfbbb3a27dc/data/score/tiles
This change adds data set caching to the ETL pipeline, saving the downloading and unpacking of over 5.6GB of data per pipeline run! (See: related ADR.)
Usage
Callers of certain features in application.py now have the option to use cached data for ETLs. Simply pass
--use-cache
or-u
, and the system will use cached data if it exists. Those features include,census-data-download
etl-run
score-full-run
data-full-run
For example,
poetry run python3 data_pipeline/application.py etl-run -u
.Callers can also: pre-fetch the data sources by calling application.py with
extract-data-sources
; clear cached data sources by calling application.py withclear-data-source-cache
; and print the data sources by calling application.py withprint-data-sources
.Cached files are stored in a new folder called data_pipeline/data/sources. This involved a fair bit of standardization work in the ETLs, because not all files were stored consistently or in a single location prior to this change (and some files were used directly from memory after being downloaded).
The caching algorithm is pretty simple: if there's something in the cache folder for the ETL, the system tries to use that something. If not, the ETL will likely fail. To refresh an ETLs data source, all you have to do is call
extract-data-sources
on that ETL without the caching option.Why not make caching the default?
I don't want this change to impact the Github Actions or anyones existing workflow. I do think this may be an incremental change to caching more broadly, but that those iterations should be done as a separate PR. This one is already big enough.
We may decide, in the future, to check the data_pipeline/data/sources folder into source control, or to allow the folder to be downloaded in zipped form from S3; this will give us consistency between every pipeline run, and allow us to vet and incorporate new data sources while understanding their impact on the score. In that case, caching would likely be turned on by default (if you can't tell, this would be my ultimate recommendation, with this PR simply introducing the caching mechanism into the code to be better used later).
But what about the results of the pipeline?
I think we're good! I took time to diff the results of main vs the results of this branch. Specifically, I looked at,
I don't have the ability to diff binary files, so a few of those were skipped. But the important stuff, like the CSV files and the tiles themselves, are identical.
And the tests? Bet you forgot about those...
I didn't forget! Previously, because there was no concept of caching, tests used a variety of directories to store their temporary contents. Now, like in the main application, we've standardized on a location for mocked test downloads (you can find it under data_pipeline/data/tmp/tests). Due to the way the tests work (magic constants!), this mean mocking the
get_sources_path
call in the ETL base class, and modifying a few tests to use that path. I believe I got all this right, and the tests run and pass with everything ending up in the right place, but would love a second set of eyes.Any other comments on the way the code is structured?
Yep! To make it easier to review, here are a few things to keep in mind.
get_data_sources
method. The superclass implementation ofextract
automatically downloads those data sources, and child classes can callsuper().extract(...)
to inherit that behavior.data_pipeline.etl.datasource.py
and come in three flavors: aFileDataSource
used to download single files, aZIPDataSource
used to download and extract zip files, and aCensusDataSource
that calls the Census API. Each of those data sources knows how to retrieve its own content.Downloader
to isolate the downloading behavior for ZIP and regular files.extract
methods. It was not at all consistent, with some ETLs performing almost no work in extract and some performing most of their work in extract. I'm not sure the balance is perfect yet, but it's better.Thanks for making it this far. 🚀