Closed sverchdotgov closed 2 years ago
So far:
One of the things I'd like to do is figure out how much of the stack needs to be spun up before you have something useful (i.e. how integrated is everything) so I'm going to start from deployment and move up through the data pipeline and then to the frontend. If that's backwards from what someone would normally do it, then maybe it'll yield something useful.
Only thing I'll note so far is that the instructions are Mac and Windows specific, with no mention of Linux, but maybe that's ok since a lot of what's documented is the basic machine setup (like installing git).
In looking at https://github.com/usds/justice40-tool/blob/main/infrastructure/README.md, it seems like that documentation is more a set of notes that would be useful for maintainers of the infrastructure, rather than something as curated to an outside contributor as the other documentation was.
I don't think there's much here, maybe a simple intro. The site is deployed on AWS using the Serverless framework, and there are some Serverless functions in there, but I don't know what they do yet.
So far, I see:
But maybe this is not the thing we expect people to care most about getting involved with. In that case, maybe a simple link to the other parts of the stack (like "if you are interested in X, go here").
Moving onto the data pipeline here: https://github.com/usds/justice40-tool/blob/main/data/data-pipeline/README.md
Notes so far:
poetry run command
versus application.py
. I also can't figure out where the implementation is for poetry run download_census
which is in the docsI think the next step here is to try to run everything, both the docker and the local setup, and try to save everything in a runner script. I think there's some opportunity to consolidate here and replace documentation with a wrapper.
@vim-usds @esfoobar-usds The "runner" that I mentioned in slack was a response to the "list of commands" in this documentation: https://github.com/usds/justice40-tool/blob/main/data/data-pipeline/README.md#running-using-docker
Maybe not worth doing, but that's where the thought came from. I'm continuing to run through the steps. Docker compose worked fine and now I'm downloading the census data.
I think I'm blocked on the non-docker setup, specifically on brew install gdal
. Digging through the logs, this appears to be the root issue:
Error while reading the URL: http://test.opendap.org/dap/data/nc/fnoc1.nc.dds?.
The OPeNDAP server returned the following message:
Service Unavailable.
I will move on from that for now and stick to the docker setup. @esfoobar-usds @vim-usds I'll show you when we pair later.
Current issue, on Step 3:
% docker run --rm -it -v ${PWD}/data/data-pipeline/data_pipeline/data:/data_pipeline/data j40_data_pipeline python3 -m data_pipeline.application score-run
2021-12-06 09:15:16,876 [data_pipeline.utils] INFO Initializing all score data
2021-12-06 09:15:16,877 [data_pipeline.etl.score.etl_score] INFO Loading data sets from disk.
Traceback (most recent call last):
File "/usr/lib/python3.8/runpy.py", line 194, in _run_module_as_main
return _run_code(code, main_globals, None,
File "/usr/lib/python3.8/runpy.py", line 87, in _run_code
exec(code, run_globals)
File "/data-pipeline/data_pipeline/application.py", line 283, in <module>
cli()
File "/usr/local/lib/python3.8/dist-packages/click/core.py", line 1137, in __call__
return self.main(*args, **kwargs)
File "/usr/local/lib/python3.8/dist-packages/click/core.py", line 1062, in main
rv = self.invoke(ctx)
File "/usr/local/lib/python3.8/dist-packages/click/core.py", line 1668, in invoke
return _process_result(sub_ctx.command.invoke(sub_ctx))
File "/usr/local/lib/python3.8/dist-packages/click/core.py", line 1404, in invoke
return ctx.invoke(self.callback, **ctx.params)
File "/usr/local/lib/python3.8/dist-packages/click/core.py", line 763, in invoke
return __callback(*args, **kwargs)
File "/data-pipeline/data_pipeline/application.py", line 126, in score_run
score_generate()
File "/data-pipeline/data_pipeline/etl/runner.py", line 82, in score_generate
score_gen.extract()
File "/data-pipeline/data_pipeline/etl/score/etl_score.py", line 40, in extract
self.ejscreen_df = pd.read_csv(
File "/usr/local/lib/python3.8/dist-packages/pandas/util/_decorators.py", line 311, in wrapper
return func(*args, **kwargs)
File "/usr/local/lib/python3.8/dist-packages/pandas/io/parsers/readers.py", line 586, in read_csv
return _read(filepath_or_buffer, kwds)
File "/usr/local/lib/python3.8/dist-packages/pandas/io/parsers/readers.py", line 482, in _read
parser = TextFileReader(filepath_or_buffer, **kwds)
File "/usr/local/lib/python3.8/dist-packages/pandas/io/parsers/readers.py", line 811, in __init__
self._engine = self._make_engine(self.engine)
File "/usr/local/lib/python3.8/dist-packages/pandas/io/parsers/readers.py", line 1040, in _make_engine
return mapping[engine](self.f, **self.options) # type: ignore[call-arg]
File "/usr/local/lib/python3.8/dist-packages/pandas/io/parsers/c_parser_wrapper.py", line 51, in __init__
self._open_handles(src, kwds)
File "/usr/local/lib/python3.8/dist-packages/pandas/io/parsers/base_parser.py", line 222, in _open_handles
self.handles = get_handle(
File "/usr/local/lib/python3.8/dist-packages/pandas/io/common.py", line 702, in get_handle
handle = open(
FileNotFoundError: [Errno 2] No such file or directory: '/data-pipeline/data_pipeline/data/dataset/ejscreen_2019/usa.csv'
I did run step 2 to do the etl, and reran it for ejscreen:
% docker run --rm -it -v ${PWD}/data/data-pipeline/data_pipeline/data:/data_pipeline/data j40_data_pipeline python3 -m data_pipeline.application etl-run -d ejscree
n
2021-12-06 09:14:56,424 [data_pipeline.etl.sources.ejscreen.etl] INFO Downloading EJScreen Data
2021-12-06 09:14:56,424 [data_pipeline.utils] INFO Downloading https://edap-arcgiscloud-data-commons.s3.amazonaws.com/EJSCREEN2020/EJSCREEN_Tract_2020_USPR.csv.zip
2021-12-06 09:14:58,492 [data_pipeline.utils] INFO Extracting /data-pipeline/data_pipeline/data/tmp/downloaded.zip
2021-12-06 09:14:59,165 [data_pipeline.etl.sources.ejscreen.etl] INFO Transforming EJScreen Data
2021-12-06 09:15:01,858 [data_pipeline.etl.sources.ejscreen.etl] INFO Saving EJScreen CSV
2021-12-06 09:15:02,937 [data_pipeline.utils] INFO Removing EJSCREEN_Tract_2020_USPR.csv
So this is probably not as well described as it should be, but the Docker setup works out of the box as follows:
docker-compose up
This will kickoff all the necessary client and frontend tasks that, after a couple of hours, is able to render a local map at localhost:8000
. That is it.
Running the Docker commands is optional and only meant to be done if you want to regenerate specific parts of the data or client after you've done some changes (or pulled some from Github).
The whole infrastructure folder at this point should be deleted, we're not using that approach anymore, which originally was to use AWS Lambda, but was replaced by Github Actions.
I'll be in the zoom link until 2pm if you have any questions.
Thanks @esfoobar-usds! This all makes more sense.
The docs also lead to the infrastructure
folder, and it sounds like instead they should lead to the github actions workflows directory.
In our discussion, these came up:
So I think we need to decide which pathways we want to invest in for newcomers of the project.
(dropping a note here that I've just made a PR to remove infrastructure
: https://github.com/usds/justice40-tool/pull/996)
So I think a decision needs to be made that affects the definition of done for this issue.
How important is it for people to understand how to run individual stages locally? If we do care about this, I think there's some organization we can add to the docs to funnel people through a "quickstart" based on who they are. If we don't care about that flow, @esfoobar-usds has suggested just moving https://github.com/usds/justice40-tool/blob/main/data/data-pipeline/README.md#score-generation-and-comparison-workflow to a wiki and documenting the docker compose
setup. Currently the docs funnel people to that page, which needs work either way.
More generally, which of the pathways below do we care most about supporting, in general? The docker compose
pathway is well supported (it worked for me perfectly on the first try), but I want to acknowledge that the way everyone on the team actually does development follows a different path.
CC @switzersc-usds, @esfoobar-usds mainly, @vim-usds, and @saran-ahluwalia on these questions and feedback on whether what I wrote below is actually correct.
Note: I call things the "Primary Development Loop" if it's what members of the team are currently actively using to do development.
docker compose
to set up everything locally, including the datasets. This is done as an extra step because we support it, but not in the critical path of feature development. The individual docker steps are only needed if there's a bug.docker compose
. I'm assuming this is also done as an extra step in the process, where this is tested after new features have been added.Turns out the GDAL install issue was actually just caused by downtime of that endpoint, I think it's working now: https://github.com/usds/justice40-tool/issues/949#issuecomment-986983779. So I'm going to continue to install GDAL and finish the local setup. The docker setup is still broken for me though: https://github.com/usds/justice40-tool/issues/949#issuecomment-986982799
Result of what we talked about in standup today:
The range of technical experience of the people who want to engage with this project is large, which is a challenge when it comes to figuring out how to create documentation that serves the needs of all potential contributors.
We decided to take this approach for now:
docker compose
.Putting this horrible regex here for reference:
echo s3://justice40-data/data-sources/census.zip | sed 's/s3:\/\/\([^\/]*\)\/\(.*\)/https:\/\/\1.s3.us-east-1.amazonaws.com\/\2/'
https://justice40-data.s3.us-east-1.amazonaws.com/data-sources/census.zip
I'm thinking for the non software engineer users having these urls that they can click on to download the data might be useful, especially if we're splitting the docs to serve different levels of technical experience.
Describe the task
Someone new to the project (like me, @sverchdotgov) run through the end to end setup process of the project and document all the pain points that we can improve on.
Acceptance Criteria
Additional context