Add logic for separating old versions of records, consolidating different calendar dates.

zimzoom commented 1 year ago

Sometimes court records are updated. Right now, the scraper creates a hash of the html file, in order to differentiate whether the file has changed since the last time it was scraped. But it doesn't actually check the hash.

We want to keep old versions of the records, differentiating each version by a field called "revision id". The latest version of a record would then be the one with the highest number for "revision id", the first version of a record would have revision id = 1.

Add database migration:

add a field for revision id
add a field for hash (for easier lookup)
add a field for case number (for easier lookup)

Add the following logic to the scraper:

if it's a brand new case number, store it with revision id = 1
if it's an old case number, but a new hash, store it with a higher rev id
if it's an old case number and the same hash, don't store the record

(Note the above logic should also solve the problem of multiple court calendar dates pointing to the same record, and currently saving them all. If that problem remains, create a new issue for it.)

tpadmanabhan commented 7 months ago

Add Joshua L as an assignee

nicolassaw commented 4 months ago

Potential Steps to Include in Parsing Architecture?

[x] Ensure that each new file when scraped or parsed includes fields for: revision id, hash, case number, date scraped
[ ] Establish when to run each program periodically
[ ] Establish the span of time covered in each program
[x] Integrate an ability to run more than one county (concurrently or in parallel)
[x] After running scraper, storing resulting HTML in its own bin for that session?
[x] After running parser, storing JSON in its own bin for that session?
[x] Evaluating the hash for each JSON file (or HTML?) against existing most-up-to-date version, replacing it if different
[x] Once complete, updating CosmosDB
[ ] Once complete, pushing dataset to visualization (is this CosmosDB?)

nicolassaw commented 4 months ago

I recommended adding a step that hashes the underlying HTML at the parser step in order to help with this architecture building.

https://github.com/open-austin/indigent-defense-stats/pull/87

nicolassaw commented 4 months ago

The checkboxes above that are marked have been implemented in the new branch "new-main" that is to be merged with main.

See the module called "updater" whose purpose is to evaluate the HTML hash, update the CosmosDB container, and handle versioning.

Storing of JSON and HTML in folder from one scraping/parsing session to the next was solved by implementing the overwriting of these files to the respective folder.

The new module orchestrator's role is to manage the running of different counties one after the other.

Hash, case number, date scraped have all been added in individual tickets or in the new version (typically in the parser step).

The next steps are to be determined by the production version on the VM and can go into a new ticket (same with the unchecked visualization issue), but otherwise, the architecture is set and this ticket can be closed.

open-austin / indigent-defense-stats

Add logic for separating old versions of records, consolidating different calendar dates. #77

Potential Steps to Include in Parsing Architecture?