As a developer or data scientist, I want to read documentation on how to get started with python notebooks in this repo so that I can play with the data and related code myself.

switzersc-usds commented 3 years ago

Description We need documentation on our python notebooks and data pipeline so that data scientists can read, use, and/or integrate with our notebooks and data.

Solution Beginner-friendly documentation in this github repo, geared towards data scientists and developers.

Describe alternatives you've considered No documentation, which would be a poor decision for empowering the community.

Links to user research or other resources

Tasks

Definition of "Done"

[ ] Documentation PR submitted, reviewed, and merged

switzersc-usds commented 3 years ago

Please add documentation to the score directory, either in the README.md or in a new docs subdirectory.

switzersc-usds commented 3 years ago

How can we make this more usable by and incorporate feedback from EJ data experts?

switzersc-usds commented 3 years ago

Public usability and moving from data to conclusion as quickly as possible. Having point and click methods will make this more accessible than Jupyter notebooks. Usability vs usefulness of the data. The client will be very interactive but how can we improve this open source data analysis stage? What's the middle ground that makes this as usable as possible without getting into client roadmap development?

Ideas for usable components:

R/Shiny dashboard with point and click tooling
Simple mapbox/leaflet integration to see things and drag them around
Add some of this to the Jupyter notebooks themselves, maybe some way to have a GUI on a notebook
Make sure even the jupyter notebook portion is dockerized, because dealing with dependencies can cause headaches; clear documentation to get the notebook running
Potentially integrate with Google Colab?

Ideas for use:

Use some of these components in user feedback sessions with EJ experts to discuss and get feedback on the data

widal001 commented 3 years ago

Building on the suggestion from @clayton-aldern for making jupyter notebooks interactive. I've done some work with Binder which rebuilds your repo using a docker container and makes python notebooks within it executable. It can take 5+ minutes to rebuild and could be longer since we have quite a bit in this repo in addition to the iPython notebooks, but I believe it is an open source project -- so an alternative to Google if we wanted to avoid vendor lock-in. Here's a tutorial on how to leverage Binder for Python jupyter notebooks.

The organization behind Binder, The Alan Turing Institute, also has some really great interactive guides on creating reproducible research that could be useful guidance for this issue as well.

clayton-aldern commented 3 years ago

Reflecting a bit more on the above, @switzersc-usds + @widal001. Right, I'm wondering if there's a world in which we can cut out some steps for data scientists or EJ advocates who want to play around with some data/code here but don't necessarily want to run a notebook locally. I forgot about Binder! It's awesome (if perhaps a little clunky?). @widal001, do you know if you can push notebook changes in a Binder instance back to GitHub? One advantage of the Colab approach is you can keep the notebooks checked in to version control. Anyway, maybe the broader point here is that maybe we could add some documentation around this kind of thing for folks who'd rather just explore in the browser? Offer a couple options?

clayton-aldern commented 3 years ago

Re: #198 more generally, the other thing that comes immediately to mind on the documentation front would be a tutorial in the README.md file (i.e. the one in score) that illustrates some use cases of that directory for data scientists? More of a 'here's why you might want to play around in this section of the repo', as opposed to only 'here's how to do it'. Might also help narrow the scope of the score directory for open-source folks?

widal001 commented 3 years ago

+1 to the suggestion for explaining why someone might want to play with the code in the score directory. I've also found documentation within the notebooks can be helpful as a mechanism for stating the assumptions that underlie a model or for providing context or guidance necessary to interpret the results of a model.

In addition to stating the underlying assumptions, notebooks can be leveraged to inspire confidence in the data cleaning and modeling scripts by making select unit tests of the code more discoverable as well. There are a few ways of doing this, but ipytest is a reasonably mature package that leverages the existing pytest tests that are defined within the notebook. Another, simpler option would be to use the ! command within a cell to execute a shell command like !pytest which would output pytest results to the notebook. More discussion of these strategies can be found on this stack exchange thread

RE: Committing changes that people make to the interactive jupyter notebooks. This isn't possible through Binder since all of the changes are hosted in a Docker container that then gets destroyed after someone ends the session. I view this as an advantage since it prevents any accidental contributions to the repo that might be possible if edits to a Colab were allowed to be committed (even as a copy), and ensures that all changes continue to follow the guidelines in CONTRIBUTING.md. That being said, if it's possible to isolate changes from Colab to a branch, I can imagine a use case where that kind of less structured contributions would be valuable.

rohitmusti commented 3 years ago

I might be misunderstanding the use case of playing around w/ the data, but I'm curious how many data advocates and researchers would prefer using a docker container over just installing the libraries and downloading locally.

I am only raising the question to limit potential future documentation updates.

usds / justice40-tool

As a developer or data scientist, I want to read documentation on how to get started with python notebooks in this repo so that I can play with the data and related code myself. #198