usds / justice40-tool

A tool to identify disadvantaged communities due to environmental, socioeconomic and health burdens
https://screeningtool.geoplatform.gov/
Creative Commons Zero v1.0 Universal
133 stars 42 forks source link

Task: Review the level of effort and barriers to entry for a new contributor to onboard into the project #949

Closed sverchdotgov closed 2 years ago

sverchdotgov commented 2 years ago

Describe the task

Someone new to the project (like me, @sverchdotgov) run through the end to end setup process of the project and document all the pain points that we can improve on.

Acceptance Criteria

Additional context

sverchdotgov commented 2 years ago

So far:

One of the things I'd like to do is figure out how much of the stack needs to be spun up before you have something useful (i.e. how integrated is everything) so I'm going to start from deployment and move up through the data pipeline and then to the frontend. If that's backwards from what someone would normally do it, then maybe it'll yield something useful.

Only thing I'll note so far is that the instructions are Mac and Windows specific, with no mention of Linux, but maybe that's ok since a lot of what's documented is the basic machine setup (like installing git).

sverchdotgov commented 2 years ago

In looking at https://github.com/usds/justice40-tool/blob/main/infrastructure/README.md, it seems like that documentation is more a set of notes that would be useful for maintainers of the infrastructure, rather than something as curated to an outside contributor as the other documentation was.

I don't think there's much here, maybe a simple intro. The site is deployed on AWS using the Serverless framework, and there are some Serverless functions in there, but I don't know what they do yet.

So far, I see:

But maybe this is not the thing we expect people to care most about getting involved with. In that case, maybe a simple link to the other parts of the stack (like "if you are interested in X, go here").

sverchdotgov commented 2 years ago

Moving onto the data pipeline here: https://github.com/usds/justice40-tool/blob/main/data/data-pipeline/README.md

Notes so far:

sverchdotgov commented 2 years ago

I think the next step here is to try to run everything, both the docker and the local setup, and try to save everything in a runner script. I think there's some opportunity to consolidate here and replace documentation with a wrapper.

sverchdotgov commented 2 years ago

@vim-usds @esfoobar-usds The "runner" that I mentioned in slack was a response to the "list of commands" in this documentation: https://github.com/usds/justice40-tool/blob/main/data/data-pipeline/README.md#running-using-docker

Maybe not worth doing, but that's where the thought came from. I'm continuing to run through the steps. Docker compose worked fine and now I'm downloading the census data.

sverchdotgov commented 2 years ago

I think I'm blocked on the non-docker setup, specifically on brew install gdal. Digging through the logs, this appears to be the root issue:

Error while reading the URL: http://test.opendap.org/dap/data/nc/fnoc1.nc.dds?.
The OPeNDAP server returned the following message:
Service Unavailable.

I will move on from that for now and stick to the docker setup. @esfoobar-usds @vim-usds I'll show you when we pair later.

sverchdotgov commented 2 years ago

Current issue, on Step 3:

% docker run --rm -it -v ${PWD}/data/data-pipeline/data_pipeline/data:/data_pipeline/data j40_data_pipeline python3 -m data_pipeline.application score-run
2021-12-06 09:15:16,876 [data_pipeline.utils] INFO     Initializing all score data
2021-12-06 09:15:16,877 [data_pipeline.etl.score.etl_score] INFO     Loading data sets from disk.
Traceback (most recent call last):
  File "/usr/lib/python3.8/runpy.py", line 194, in _run_module_as_main
    return _run_code(code, main_globals, None,
  File "/usr/lib/python3.8/runpy.py", line 87, in _run_code
    exec(code, run_globals)
  File "/data-pipeline/data_pipeline/application.py", line 283, in <module>
    cli()
  File "/usr/local/lib/python3.8/dist-packages/click/core.py", line 1137, in __call__
    return self.main(*args, **kwargs)
  File "/usr/local/lib/python3.8/dist-packages/click/core.py", line 1062, in main
    rv = self.invoke(ctx)
  File "/usr/local/lib/python3.8/dist-packages/click/core.py", line 1668, in invoke
    return _process_result(sub_ctx.command.invoke(sub_ctx))
  File "/usr/local/lib/python3.8/dist-packages/click/core.py", line 1404, in invoke
    return ctx.invoke(self.callback, **ctx.params)
  File "/usr/local/lib/python3.8/dist-packages/click/core.py", line 763, in invoke
    return __callback(*args, **kwargs)
  File "/data-pipeline/data_pipeline/application.py", line 126, in score_run
    score_generate()
  File "/data-pipeline/data_pipeline/etl/runner.py", line 82, in score_generate
    score_gen.extract()
  File "/data-pipeline/data_pipeline/etl/score/etl_score.py", line 40, in extract
    self.ejscreen_df = pd.read_csv(
  File "/usr/local/lib/python3.8/dist-packages/pandas/util/_decorators.py", line 311, in wrapper
    return func(*args, **kwargs)
  File "/usr/local/lib/python3.8/dist-packages/pandas/io/parsers/readers.py", line 586, in read_csv
    return _read(filepath_or_buffer, kwds)
  File "/usr/local/lib/python3.8/dist-packages/pandas/io/parsers/readers.py", line 482, in _read
    parser = TextFileReader(filepath_or_buffer, **kwds)
  File "/usr/local/lib/python3.8/dist-packages/pandas/io/parsers/readers.py", line 811, in __init__
    self._engine = self._make_engine(self.engine)
  File "/usr/local/lib/python3.8/dist-packages/pandas/io/parsers/readers.py", line 1040, in _make_engine
    return mapping[engine](self.f, **self.options)  # type: ignore[call-arg]
  File "/usr/local/lib/python3.8/dist-packages/pandas/io/parsers/c_parser_wrapper.py", line 51, in __init__
    self._open_handles(src, kwds)
  File "/usr/local/lib/python3.8/dist-packages/pandas/io/parsers/base_parser.py", line 222, in _open_handles
    self.handles = get_handle(
  File "/usr/local/lib/python3.8/dist-packages/pandas/io/common.py", line 702, in get_handle
    handle = open(
FileNotFoundError: [Errno 2] No such file or directory: '/data-pipeline/data_pipeline/data/dataset/ejscreen_2019/usa.csv'
sverchdotgov commented 2 years ago

I did run step 2 to do the etl, and reran it for ejscreen:

% docker run --rm -it -v ${PWD}/data/data-pipeline/data_pipeline/data:/data_pipeline/data j40_data_pipeline python3 -m data_pipeline.application etl-run -d ejscree
n
2021-12-06 09:14:56,424 [data_pipeline.etl.sources.ejscreen.etl] INFO     Downloading EJScreen Data
2021-12-06 09:14:56,424 [data_pipeline.utils] INFO     Downloading https://edap-arcgiscloud-data-commons.s3.amazonaws.com/EJSCREEN2020/EJSCREEN_Tract_2020_USPR.csv.zip
2021-12-06 09:14:58,492 [data_pipeline.utils] INFO     Extracting /data-pipeline/data_pipeline/data/tmp/downloaded.zip
2021-12-06 09:14:59,165 [data_pipeline.etl.sources.ejscreen.etl] INFO     Transforming EJScreen Data
2021-12-06 09:15:01,858 [data_pipeline.etl.sources.ejscreen.etl] INFO     Saving EJScreen CSV
2021-12-06 09:15:02,937 [data_pipeline.utils] INFO     Removing EJSCREEN_Tract_2020_USPR.csv
esfoobar-usds commented 2 years ago

So this is probably not as well described as it should be, but the Docker setup works out of the box as follows:

This will kickoff all the necessary client and frontend tasks that, after a couple of hours, is able to render a local map at localhost:8000. That is it.

Running the Docker commands is optional and only meant to be done if you want to regenerate specific parts of the data or client after you've done some changes (or pulled some from Github).

The whole infrastructure folder at this point should be deleted, we're not using that approach anymore, which originally was to use AWS Lambda, but was replaced by Github Actions.

I'll be in the zoom link until 2pm if you have any questions.

sverchdotgov commented 2 years ago

Thanks @esfoobar-usds! This all makes more sense.

The docs also lead to the infrastructure folder, and it sounds like instead they should lead to the github actions workflows directory.

sverchdotgov commented 2 years ago

In our discussion, these came up:

So I think we need to decide which pathways we want to invest in for newcomers of the project.

switzersc-usds commented 2 years ago

(dropping a note here that I've just made a PR to remove infrastructure: https://github.com/usds/justice40-tool/pull/996)

sverchdotgov commented 2 years ago

So I think a decision needs to be made that affects the definition of done for this issue.

Scope Question

How important is it for people to understand how to run individual stages locally? If we do care about this, I think there's some organization we can add to the docs to funnel people through a "quickstart" based on who they are. If we don't care about that flow, @esfoobar-usds has suggested just moving https://github.com/usds/justice40-tool/blob/main/data/data-pipeline/README.md#score-generation-and-comparison-workflow to a wiki and documenting the docker compose setup. Currently the docs funnel people to that page, which needs work either way.

More generally, which of the pathways below do we care most about supporting, in general? The docker compose pathway is well supported (it worked for me perfectly on the first try), but I want to acknowledge that the way everyone on the team actually does development follows a different path.

CC @switzersc-usds, @esfoobar-usds mainly, @vim-usds, and @saran-ahluwalia on these questions and feedback on whether what I wrote below is actually correct.

Current Usage Options

Note: I call things the "Primary Development Loop" if it's what members of the team are currently actively using to do development.

sverchdotgov commented 2 years ago

Turns out the GDAL install issue was actually just caused by downtime of that endpoint, I think it's working now: https://github.com/usds/justice40-tool/issues/949#issuecomment-986983779. So I'm going to continue to install GDAL and finish the local setup. The docker setup is still broken for me though: https://github.com/usds/justice40-tool/issues/949#issuecomment-986982799

sverchdotgov commented 2 years ago

Result of what we talked about in standup today:

The range of technical experience of the people who want to engage with this project is large, which is a challenge when it comes to figuring out how to create documentation that serves the needs of all potential contributors.

We decided to take this approach for now:

sverchdotgov commented 2 years ago

Putting this horrible regex here for reference:

echo s3://justice40-data/data-sources/census.zip | sed 's/s3:\/\/\([^\/]*\)\/\(.*\)/https:\/\/\1.s3.us-east-1.amazonaws.com\/\2/'
https://justice40-data.s3.us-east-1.amazonaws.com/data-sources/census.zip

I'm thinking for the non software engineer users having these urls that they can click on to download the data might be useful, especially if we're splitting the docs to serve different levels of technical experience.