open-austin / indigent-defense-stats

A web scraper for collecting and processing public case records from sites using Tyler Technology's Odyssey court records database software.
MIT License
19 stars 7 forks source link
court-cases parser python scraper

Tyler Technologies Odyssey scraper and parser

This is a scraper to collect and process public case records from the Tyler Technologies Odyssey court records system. If you are a dev or want to file an Issue, please read CONTRIBUTING.

Local setup

Install toolchain

  1. Clone this repo and navigate to it.
    • git clone https://github.com/open-austin/indigent-defense-stats
    • cd indigent-defense-stats
  2. Install Pyenv if not already installed (linux, mac, or windows)
  3. Run pyenv install to get the right Python version

Setup venv

First, you'll need to create a virtual environment, this differs depending on your OS.

On linux/mac

python -m venv .venv --prompt ids # (you can substitute `ids` for any name you want)

On Windows

c:\>Python35\python -m venv c:\path\to\repo\ids # (you can substitute `ids` for any name you want)

Next, you'll need to "activate" the venv. You'll need to run this command every time you work in the codebase and tell your IDE which Python environment to use. It will likely default to wherever python resolves to in your system path. The specific command you run will depend on both your OS and shell.

On linux/mac

platform shell Command to activate virtual environment
POSIX bash/zsh $ source /bin/activate
fish $ source /bin/activate.fish
csh/tcsh $ source /bin/activate.csh
PowerShell $ /bin/Activate.ps1
Windows cmd.exe C:> \Scripts\activate.bat
PowerShell PS C:> \Scripts\Activate.ps1

Source: https://docs.python.org/3/library/venv.html#how-venvs-work

Note: Again, you'll need to activate venv every time you want to work in the codebase.

If the above doesn't work, try these instructions for creating and activating a virtual environment:

  1. Navigate to your project directory: cd [insert file path]
  2. Create a virtual environenment: python -m venv venv
  3. Activate the virtual environment: .\venv\Scripts\activate.bat

Install python dependencies

Using pip, install the project dependencies.

pip install -r requirements.txt

Running CLI

@TODO - this section needs to be updated.

  1. Set parameters to the main command:
    • counties = The counties that are listed in the count CSV. Update column "scraper" in the CSV to "yes" to include the county.
    • start_date = The first date you want to scrape for case data. Update in scraper.
    • end_date = The last date you want to scrape for case data. Update in scraper.
  2. Run the handler.
    • python run python .src/orchestrator

Structure of Code

Flowchart: Relationships Between Functions and Directories

flowchart TD
    orchestrator{"src/orchestrator (class): <br> orchestrate (function)"} --> county_db[resources/texas_county_data.csv]
    county_db  --> |return counties where 'scrape' = 'yes'| orchestrator
    orchestrator -->|loop through these counties <br> and run these four functions| scraper(1. src/scraper: scrape)
    scraper --> parser(2. src/parser: parse)
    scraper --> |create 1 HTML per case| data_html[data/county/case_html/case_id.html]
    parser--> pre2017(src/parser/pre2017)
    parser--> post2017(src/parser/post2017)
    pre2017 --> cleaner[3. src/cleaner: clean]
    post2017 --> cleaner
    parser --> |create 1 JSON per case| data_json[data/county/case_json/case_id.json]
    cleaner --> |look for charge in db<br>and normalize it to uccs| charge_db[resouces/umich-uccs-database.json]
    charge_db --> cleaner
    cleaner --> updater(4. src/updater: update)
    cleaner --> |create 1 JSON per case| data_json_cleaned[data/county/case_json_cleaned/case_id.json]
    updater --> |send final cleaned JSON to CosmosDB container| CosmosDB_container[CosmosDB container]
    CosmosDB_container --> visualization{live visualization}