Task: Review the level of effort and barriers to entry for a new contributor to onboard into the project

sverchdotgov commented 2 years ago

Describe the task

Someone new to the project (like me, @sverchdotgov) run through the end to end setup process of the project and document all the pain points that we can improve on.

Acceptance Criteria

[ ] Make any small updates as needed
[ ] File new issues if there are things that are high in complexity or time

Additional context

TODO

sverchdotgov commented 2 years ago

So far:

The README structure is very clear, I could get straight to https://github.com/usds/justice40-tool/blob/main/INSTALLATION.md
I can see the different parts of the stack clearly split up here (deployment, data pipeline, frontend) https://github.com/usds/justice40-tool/blob/main/INSTALLATION.md#repo-organization

One of the things I'd like to do is figure out how much of the stack needs to be spun up before you have something useful (i.e. how integrated is everything) so I'm going to start from deployment and move up through the data pipeline and then to the frontend. If that's backwards from what someone would normally do it, then maybe it'll yield something useful.

Only thing I'll note so far is that the instructions are Mac and Windows specific, with no mention of Linux, but maybe that's ok since a lot of what's documented is the basic machine setup (like installing git).

sverchdotgov commented 2 years ago

In looking at https://github.com/usds/justice40-tool/blob/main/infrastructure/README.md, it seems like that documentation is more a set of notes that would be useful for maintainers of the infrastructure, rather than something as curated to an outside contributor as the other documentation was.

I don't think there's much here, maybe a simple intro. The site is deployed on AWS using the Serverless framework, and there are some Serverless functions in there, but I don't know what they do yet.

So far, I see:

Add an overview, just saying what services we're using and for what.
Document what the different environments are for, and how to add a new one.
Document what the functions are for.

But maybe this is not the thing we expect people to care most about getting involved with. In that case, maybe a simple link to the other parts of the stack (like "if you are interested in X, go here").

sverchdotgov commented 2 years ago

Moving onto the data pipeline here: https://github.com/usds/justice40-tool/blob/main/data/data-pipeline/README.md

Notes so far:

Under https://github.com/usds/justice40-tool/blob/main/data/data-pipeline/README.md#running-using-docker, there's a list of long docker commands that is a bit of a struggle, I wonder if there should be some kind of runner script to capture that boilerplate: docker run --rm -it -v ${PWD}/data/data-pipeline/data_pipeline/data:/data_pipeline/data j40_data_pipeline python3 -m data_pipeline.application
This has an overview of the workflow: https://github.com/usds/justice40-tool/blob/main/data/data-pipeline/README.md#score-generation-and-comparison-workflow which is great, but in some steps it has "run with poetry" and "run with docker" examples even though those also have dedicated sections below it. I wonder if there's a way to consolidate this.
It's unclear when to use poetry run command versus application.py. I also can't figure out where the implementation is for poetry run download_census which is in the docs
This comparison being an ipython notebook presents an opportunity to make a library that might be a bit more reusable: https://github.com/usds/justice40-tool/blob/main/data/data-pipeline/README.md#step-4-compare-the-justice40-score-experiments-to-other-indices. Although maybe the point is that it's interactive and a library may provide limited value. Have to see what's in there.

sverchdotgov commented 2 years ago

I think the next step here is to try to run everything, both the docker and the local setup, and try to save everything in a runner script. I think there's some opportunity to consolidate here and replace documentation with a wrapper.

sverchdotgov commented 2 years ago

@vim-usds @esfoobar-usds The "runner" that I mentioned in slack was a response to the "list of commands" in this documentation: https://github.com/usds/justice40-tool/blob/main/data/data-pipeline/README.md#running-using-docker

Maybe not worth doing, but that's where the thought came from. I'm continuing to run through the steps. Docker compose worked fine and now I'm downloading the census data.

sverchdotgov commented 2 years ago

I think I'm blocked on the non-docker setup, specifically on brew install gdal. Digging through the logs, this appears to be the root issue:

Error while reading the URL: http://test.opendap.org/dap/data/nc/fnoc1.nc.dds?.
The OPeNDAP server returned the following message:
Service Unavailable.

I will move on from that for now and stick to the docker setup. @esfoobar-usds @vim-usds I'll show you when we pair later.

sverchdotgov commented 2 years ago

Current issue, on Step 3:

% docker run --rm -it -v ${PWD}/data/data-pipeline/data_pipeline/data:/data_pipeline/data j40_data_pipeline python3 -m data_pipeline.application score-run
2021-12-06 09:15:16,876 [data_pipeline.utils] INFO     Initializing all score data
2021-12-06 09:15:16,877 [data_pipeline.etl.score.etl_score] INFO     Loading data sets from disk.
Traceback (most recent call last):
  File "/usr/lib/python3.8/runpy.py", line 194, in _run_module_as_main
    return _run_code(code, main_globals, None,
  File "/usr/lib/python3.8/runpy.py", line 87, in _run_code
    exec(code, run_globals)
  File "/data-pipeline/data_pipeline/application.py", line 283, in <module>
    cli()
  File "/usr/local/lib/python3.8/dist-packages/click/core.py", line 1137, in __call__
    return self.main(*args, **kwargs)
  File "/usr/local/lib/python3.8/dist-packages/click/core.py", line 1062, in main
    rv = self.invoke(ctx)
  File "/usr/local/lib/python3.8/dist-packages/click/core.py", line 1668, in invoke
    return _process_result(sub_ctx.command.invoke(sub_ctx))
  File "/usr/local/lib/python3.8/dist-packages/click/core.py", line 1404, in invoke
    return ctx.invoke(self.callback, **ctx.params)
  File "/usr/local/lib/python3.8/dist-packages/click/core.py", line 763, in invoke
    return __callback(*args, **kwargs)
  File "/data-pipeline/data_pipeline/application.py", line 126, in score_run
    score_generate()
  File "/data-pipeline/data_pipeline/etl/runner.py", line 82, in score_generate
    score_gen.extract()
  File "/data-pipeline/data_pipeline/etl/score/etl_score.py", line 40, in extract
    self.ejscreen_df = pd.read_csv(
  File "/usr/local/lib/python3.8/dist-packages/pandas/util/_decorators.py", line 311, in wrapper
    return func(*args, **kwargs)
  File "/usr/local/lib/python3.8/dist-packages/pandas/io/parsers/readers.py", line 586, in read_csv
    return _read(filepath_or_buffer, kwds)
  File "/usr/local/lib/python3.8/dist-packages/pandas/io/parsers/readers.py", line 482, in _read
    parser = TextFileReader(filepath_or_buffer, **kwds)
  File "/usr/local/lib/python3.8/dist-packages/pandas/io/parsers/readers.py", line 811, in __init__
    self._engine = self._make_engine(self.engine)
  File "/usr/local/lib/python3.8/dist-packages/pandas/io/parsers/readers.py", line 1040, in _make_engine
    return mapping[engine](self.f, **self.options)  # type: ignore[call-arg]
  File "/usr/local/lib/python3.8/dist-packages/pandas/io/parsers/c_parser_wrapper.py", line 51, in __init__
    self._open_handles(src, kwds)
  File "/usr/local/lib/python3.8/dist-packages/pandas/io/parsers/base_parser.py", line 222, in _open_handles
    self.handles = get_handle(
  File "/usr/local/lib/python3.8/dist-packages/pandas/io/common.py", line 702, in get_handle
    handle = open(
FileNotFoundError: [Errno 2] No such file or directory: '/data-pipeline/data_pipeline/data/dataset/ejscreen_2019/usa.csv'

sverchdotgov commented 2 years ago

I did run step 2 to do the etl, and reran it for ejscreen:

% docker run --rm -it -v ${PWD}/data/data-pipeline/data_pipeline/data:/data_pipeline/data j40_data_pipeline python3 -m data_pipeline.application etl-run -d ejscree
n
2021-12-06 09:14:56,424 [data_pipeline.etl.sources.ejscreen.etl] INFO     Downloading EJScreen Data
2021-12-06 09:14:56,424 [data_pipeline.utils] INFO     Downloading https://edap-arcgiscloud-data-commons.s3.amazonaws.com/EJSCREEN2020/EJSCREEN_Tract_2020_USPR.csv.zip
2021-12-06 09:14:58,492 [data_pipeline.utils] INFO     Extracting /data-pipeline/data_pipeline/data/tmp/downloaded.zip
2021-12-06 09:14:59,165 [data_pipeline.etl.sources.ejscreen.etl] INFO     Transforming EJScreen Data
2021-12-06 09:15:01,858 [data_pipeline.etl.sources.ejscreen.etl] INFO     Saving EJScreen CSV
2021-12-06 09:15:02,937 [data_pipeline.utils] INFO     Removing EJSCREEN_Tract_2020_USPR.csv

esfoobar-usds commented 2 years ago

So this is probably not as well described as it should be, but the Docker setup works out of the box as follows:

Clone the repo
Type docker-compose up

This will kickoff all the necessary client and frontend tasks that, after a couple of hours, is able to render a local map at localhost:8000. That is it.

Running the Docker commands is optional and only meant to be done if you want to regenerate specific parts of the data or client after you've done some changes (or pulled some from Github).

The whole infrastructure folder at this point should be deleted, we're not using that approach anymore, which originally was to use AWS Lambda, but was replaced by Github Actions.

I'll be in the zoom link until 2pm if you have any questions.

sverchdotgov commented 2 years ago

Thanks @esfoobar-usds! This all makes more sense.

The docs also lead to the infrastructure folder, and it sounds like instead they should lead to the github actions workflows directory.

sverchdotgov commented 2 years ago

In our discussion, these came up:

It's unclear how much people need to run each stage of the process separately.
Members of the team aren't using docker in their normal workflows, mostly using python directly because their workstations are set up.

So I think we need to decide which pathways we want to invest in for newcomers of the project.

switzersc-usds commented 2 years ago

(dropping a note here that I've just made a PR to remove infrastructure: https://github.com/usds/justice40-tool/pull/996)

sverchdotgov commented 2 years ago

So I think a decision needs to be made that affects the definition of done for this issue.

Scope Question

How important is it for people to understand how to run individual stages locally? If we do care about this, I think there's some organization we can add to the docs to funnel people through a "quickstart" based on who they are. If we don't care about that flow, @esfoobar-usds has suggested just moving https://github.com/usds/justice40-tool/blob/main/data/data-pipeline/README.md#score-generation-and-comparison-workflow to a wiki and documenting the docker compose setup. Currently the docs funnel people to that page, which needs work either way.

More generally, which of the pathways below do we care most about supporting, in general? The docker compose pathway is well supported (it worked for me perfectly on the first try), but I want to acknowledge that the way everyone on the team actually does development follows a different path.

CC @switzersc-usds, @esfoobar-usds mainly, @vim-usds, and @saran-ahluwalia on these questions and feedback on whether what I wrote below is actually correct.

Current Usage Options

Note: I call things the "Primary Development Loop" if it's what members of the team are currently actively using to do development.

Frontend Developer (point of expertise, @vim-usds): There are two major workflow here.
- Primary Development Loop (local setup with node and gatsby): Install everything locally, use the remote s3 bucket for the data backend (I don't believe the s3 urls are documented in an obvious place). The docs are being updated in https://github.com/usds/justice40-tool/pull/859.
- Alternative Development Loop (docker compose): Run docker compose to set up everything locally, including the datasets. This is done as an extra step because we support it, but not in the critical path of feature development. The individual docker steps are only needed if there's a bug.
Data Pipeline Developer/Data Scientist (point of expertise: @esfoobar-usds): There are many things going on here.
- Primary Development Loop (poetry and python locally to run application.py): Install everything locally and run https://github.com/usds/justice40-tool/blob/main/data/data-pipeline/data_pipeline/application.py as the entry point to the application.
- Primary Deploy Path (github actions workflows): The source of truth for infrastructure deployment lives in https://github.com/usds/justice40-tool/tree/main/.github/workflows, and everything now runs through github actions tasks that interact with GeoPlatform's AWS. The docs currently point at https://github.com/usds/justice40-tool/tree/main/infrastructure which should be deleted (in progress: https://github.com/usds/justice40-tool/pull/996).
- Alternative Development Loop (docker compose): The whole data pipeline also runs locally with docker compose. I'm assuming this is also done as an extra step in the process, where this is tested after new features have been added.
- Alternative Development Supporting Tools (individual docker pipeline stages, and local pipeline stages): There are steps to run individual steps of the pipeline using docker in https://github.com/usds/justice40-tool/blob/main/data/data-pipeline/README.md#score-generation-and-comparison-workflow.

sverchdotgov commented 2 years ago

Turns out the GDAL install issue was actually just caused by downtime of that endpoint, I think it's working now: https://github.com/usds/justice40-tool/issues/949#issuecomment-986983779. So I'm going to continue to install GDAL and finish the local setup. The docker setup is still broken for me though: https://github.com/usds/justice40-tool/issues/949#issuecomment-986982799

sverchdotgov commented 2 years ago

Result of what we talked about in standup today:

The range of technical experience of the people who want to engage with this project is large, which is a challenge when it comes to figuring out how to create documentation that serves the needs of all potential contributors.

We decided to take this approach for now:

Change the documentation in github to be focused on the less technical contributors.
- For those who just want the data: Link to the s3 buckets that host the raw data, and document what data is in there.
- For those who want to run the system locally: Recommend using docker compose.
Move the more "power user" documentation to the github wiki.
- We will split this documentation based on role.
- @sverchdotgov will create the initial skeleton, so that we can all fill in what we know currently in parallel, and have a clear place to add our learnings from user research.

sverchdotgov commented 2 years ago

Putting this horrible regex here for reference:

echo s3://justice40-data/data-sources/census.zip | sed 's/s3:\/\/\([^\/]*\)\/\(.*\)/https:\/\/\1.s3.us-east-1.amazonaws.com\/\2/'
https://justice40-data.s3.us-east-1.amazonaws.com/data-sources/census.zip

I'm thinking for the non software engineer users having these urls that they can click on to download the data might be useful, especially if we're splitting the docs to serve different levels of technical experience.

usds / justice40-tool