open-austin / indigent-defense-stats

A web scraper for collecting and processing public case records from sites using Tyler Technology's Odyssey court records database software.
MIT License
19 stars 7 forks source link

Add logic for separating old versions of records, consolidating different calendar dates. #77

Closed zimzoom closed 4 months ago

zimzoom commented 1 year ago

Sometimes court records are updated. Right now, the scraper creates a hash of the html file, in order to differentiate whether the file has changed since the last time it was scraped. But it doesn't actually check the hash.

We want to keep old versions of the records, differentiating each version by a field called "revision id". The latest version of a record would then be the one with the highest number for "revision id", the first version of a record would have revision id = 1.

Add database migration:

Add the following logic to the scraper:

  1. if it's a brand new case number, store it with revision id = 1
  2. if it's an old case number, but a new hash, store it with a higher rev id
  3. if it's an old case number and the same hash, don't store the record

(Note the above logic should also solve the problem of multiple court calendar dates pointing to the same record, and currently saving them all. If that problem remains, create a new issue for it.)

tpadmanabhan commented 7 months ago

Add Joshua L as an assignee

nicolassaw commented 4 months ago

Potential Steps to Include in Parsing Architecture?

nicolassaw commented 4 months ago

I recommended adding a step that hashes the underlying HTML at the parser step in order to help with this architecture building.

https://github.com/open-austin/indigent-defense-stats/pull/87

nicolassaw commented 4 months ago

The checkboxes above that are marked have been implemented in the new branch "new-main" that is to be merged with main.

See the module called "updater" whose purpose is to evaluate the HTML hash, update the CosmosDB container, and handle versioning.

Storing of JSON and HTML in folder from one scraping/parsing session to the next was solved by implementing the overwriting of these files to the respective folder.

The new module orchestrator's role is to manage the running of different counties one after the other.

Hash, case number, date scraped have all been added in individual tickets or in the new version (typically in the parser step).

The next steps are to be determined by the production version on the VM and can go into a new ticket (same with the unchecked visualization issue), but otherwise, the architecture is set and this ticket can be closed.