sc3 / cookcountyjail

A Django app that tracks the population of Cook County Jail over time and summarizes trends.
http://cookcountyjail.recoveredfactory.net/api/1.0/?format=json
Other
31 stars 23 forks source link

should we implement a concurrent model for v2 API? #421

Closed bepetersn closed 10 years ago

bepetersn commented 10 years ago

It seems to me the answer depends on what v2 does. The main benefit of the concurrent model is network-related stuff. On the one hand, v2 API is basically just a database with a mouth. On the other hand, I'm not sure.

bepetersn commented 10 years ago

The way that I thought of this was in considering what these old logging functions (copied over from v1 API) still mean in the context of a separate entity API, basically #422. Sure, the API could send back responses, but does that mean it needs a queue to handle this interface? Otherwise, ... it would take a rather long time to respond, in all likelihood, right? From the scraper's perspective, this isn't necessarily a big deal, since it doesn't block for stuff like that.

Or, if we don't want to do this, is logging just not possible for the database aspect of our project? Imagine what happens if the API gets an error, then. How does the scraper even know? I feel like we've introduced so many edge cases, it makes me dizzy. @wilbertom

nwinklareth commented 10 years ago

For the specific case of Posts to update the db as the scraper operates, with the new scraper model of archiving the data the post API calls go away as does, so no need for concurrency to handle this case and no nasty edge case to worry about. Hurray.

In the general case, we basically fetch data do a bit of processing on it and return it to the caller so no specialized concurrent processing a la the Scraper is needed. Which means that there is no concurrent processing needed for the 2.0 Web App.

nwinklareth commented 10 years ago

The case of the process that updates the 2.0 database may benefit from concurrent programming. In the CPython world concurrent processing works best when you can do computational processing in task while the machine is doing I/O. In this processing I/O occurs when input is read from the archive file, be it disk or network and when database reading and writing occurs. Whenever I/O is being performed you would like to switch to another task and perform some computational task.

Additionally we know that there is no dependency between inmate entries. In fact we could process all of them in parallel if we wanted to. We also know that processing an inmate's entry has a serial and parallel part. The serial part has to do with the Stay portion and the logic looks like this:

  1. If the Stay does not exist then goto step 3
  2. Check if the discharged field is set, if it clear it, save the entry and done
  3. If the person does not exist the create it
  4. Create a stay entry using the person id and done

Now that we have the stay id, the processing the following history processing tasks can be done in parallel, although each task is passed the stay id:

The structure of these history tasks is identical:

  1. Fetch latest history entry associated with stay id
  2. If identical to information in the inmate entry done
  3. Fetch base model entry, if it does not exist then create entry in base model
  4. Create new history entry

In this design all Database I/O is done by a single task which does all Database I/O. The reason for this is that you only want one process working the disk head and the database cursors and buffers. All other model processing, the computational bits are done by other tasks, while the database task is blocked doing I/O. This will maximizes I/O throughput. The implications of this constraint is that we either make all of the other tasks event based or we have lots of small objects that do the next step in the process. I don't know which one will have better performance, nor do I know which approach will be easier for people to understand. Thoughts?

It should be easy to test this under different settings to see if the concurrent processing approach is more effective or not. In fact it is an interesting experiment.