Closed zimzoom closed 4 months ago
Add Joshua L as an assignee
I recommended adding a step that hashes the underlying HTML at the parser step in order to help with this architecture building.
https://github.com/open-austin/indigent-defense-stats/pull/87
The checkboxes above that are marked have been implemented in the new branch "new-main" that is to be merged with main.
See the module called "updater" whose purpose is to evaluate the HTML hash, update the CosmosDB container, and handle versioning.
Storing of JSON and HTML in folder from one scraping/parsing session to the next was solved by implementing the overwriting of these files to the respective folder.
The new module orchestrator's role is to manage the running of different counties one after the other.
Hash, case number, date scraped have all been added in individual tickets or in the new version (typically in the parser step).
The next steps are to be determined by the production version on the VM and can go into a new ticket (same with the unchecked visualization issue), but otherwise, the architecture is set and this ticket can be closed.
Sometimes court records are updated. Right now, the scraper creates a hash of the html file, in order to differentiate whether the file has changed since the last time it was scraped. But it doesn't actually check the hash.
We want to keep old versions of the records, differentiating each version by a field called "revision id". The latest version of a record would then be the one with the highest number for "revision id", the first version of a record would have revision id = 1.
Add database migration:
Add the following logic to the scraper:
(Note the above logic should also solve the problem of multiple court calendar dates pointing to the same record, and currently saving them all. If that problem remains, create a new issue for it.)