the-radiativity-company / inmates

a tool for collating inmate rosters
2 stars 1 forks source link

inmates

CI build statusCron build status

This project is being conducted on behalf of Chicago Community Bond Fund.

Problem Space

Chicago Community Bond fund is struggling to meet capacity. New donations have increased their ability to conduct their mission at a much larger scale as they are growing their efforts state-wide.

One bottleneck is the various sites that have inmate information. The current process is to manually check these sites to see if they have new information. This information is to be used for advocacy purposes.

It is difficult for volunteers to track the current list of county sites to check.

It is even more difficult for a person to verify whether there is new information on that site or not. It is also difficult to then combine that manually collected information together to create actionable data.

Luckily, these are all spaces where an automated solution can greatly increase the efficiency of human efforts!

Goals

Contributing

There are many ways to contribute to any of the above goals!

The biggest current need is for code contributions to begin scrapping the data from the various county websites. These are currently documented here.

This work is in early stages, so expect more detailed process documentation.

Each website will require it's own scraper(called "spiders" in Scrapy--the Python webscrapping framework we're using) of varying complexity.

Setup

This project is written in python and currently uses make as a buildtool (see this post for getting make on windows). If you would like to render a deployable artifact, you will need docker. To start, though, a virtualenv will do. Run the following to build your venv and 'source' it's context:

Now you can run pip list to see that the inmates CLI tool was installed. Execute the inmates command to see available subcommands (inmates csv -c 'Roster Link' can be helpful).

Development

In service of the aforestated goals, a scrapy.Spider for each county will be created. Most everything needed to develop a such a scraper is provided within the project. Please see an overview of the project layout, below (generated via make tree--use make commands to see all available commands):

inmates/
├── commissary
│   ├── adams.pdf
│   ├── ...
│   └── woodford.html
├── inmates
│   ├── cli.py
│   ├── commands
│   │   ├── ...
│   │   └── cmd_somecommand.py
│   ├── scraper
│   │   ├── items.py
│   │   ├── ...
│   │   ├── settings.py
│   │   └── spiders
│   │       ├── adams.py
│   │       ├── ...
│   │       └── woodford.py
│   └── utils.py
└── tests
    ├── fixtures
    │   ├── adams.json
    │   ├── ...
    │   └── woodford.json
    ├── test_adams.py
    ├── ...
    └── test_woodford.py

When scraping data from a roster, here are four components at play:

The 'sites to be scraped' are all housed in the "commissary/" directory so there's less of a need to reach out to the world wide web. Spiders live in the "inmates/scraper/spiders/" directory and there's to be one for every site in the commissary/. To get started on a new spider, simply run the following:

make new-spider NAME=new

where "new" is the name of your NewSpider at "inmates/scraper/spiders/new.py". Use the FORCE=true flag if you'd like to overwrite an existing spider.

To run spiders, use the inmates collate subcommand. Results can be saved using the -o/--outdir option that accepts a path to a directory where collected records are to be stored. An individual spider can be run using the -r/--roster option. A helpful approach during development of a spider's .parse method is to leverage setting "breakpoints". Placing ipdb.set_trace(context=15) in the path of code to be executed, will pause execution once the "breakpoint" is hit (where the context keyword governs the scope of code visible when execution is paused). With execution paused, defined variables can be probed and response parsing can be explored in opened shell.

Deployment

As of issue #11, spiders are set vi CI to crawl at cron's cadence and deposit results in an AWS S3 bucket. The following command can be used at runtime to invoke live crawling for all spiders:

make scraper-run

Default behavior for this command is to only output to stdout. If output is to be collected, the $LIVESITE_PARSED_OUTPUT_DIR environment variable can be set wherein results will be deposited.