sc3 / cookcountyjail

A Django app that tracks the population of Cook County Jail over time and summarizes trends.
http://cookcountyjail.recoveredfactory.net/api/1.0/?format=json
Other
31 stars 23 forks source link

Implement a V2 Database loader that use the Raw Inmate Data #445

Open nwinklareth opened 10 years ago

nwinklareth commented 10 years ago

This is related to #395.

The first two versions of the Cook County Jail Scraper directly stored information into the V1.0 API database. The third version of this Scraper will no llonger do that. Which means that to load a database will require writing a loader which fetches the raw Inmate data from http://cookcountyjail.recoveredfactory.net/raw_inmate_data, which is stored by day in a CSV format and mapping into the schema of the database.

The V2 loader needs to be Idempotent meaning that it could load the same file multiple times and only one set of changes would be stored in the database. As a result of this requirement the loader could just fetch all of the data files and just keep reloading the database. While this would work it is a very very wasteful and costly way to implement the functionality. So loaders should remember the last data file they loaded and only load ones created after it.

The Scraper will not signal when it is has finished scraping, nor is its finishing time fixed, so the only way to detect if a new data file is available is through polling. The scraper starts at 8:30 am CST. To reduce the load on the server, limit polling requests to every 15 minutes and once finished processing do not look poll until after 10:30 am CST the next day.

For a number of reason's there is no guarantee of daily snapshots. If a few snapshots are missed there will be a gap in the daily snapshots with the scraper recording only the last date found. This snapshot will contain booked inmates from the missed days, however it will not be all of them. This gap only affects the look up logic for raw Inmate data and should have no affect on normal processing.

Two caveats

There is an open question if the loader does inmate discharging. If it does handle discharging, then it needs to take the difference between the set of inmates in the data file and the set of active inmates in the database. That difference is the set of discharged inmates.

If it does not then there must be a signalling that the loader has finished processing a single file so the discharging can occur, this cycle repeats until all data files have been processed. The detection logic is the same.