harvard-lil / h2o

H2O is a web app for creating and reading open educational resources, primarily in the legal field
https://opencasebook.org
GNU Affero General Public License v3.0
35 stars 30 forks source link

Correct case ingestion to never pull an existing case from upstream #2006

Closed lizadaly closed 1 year ago

lizadaly commented 1 year ago

I made some erroneous assumptions about how the case ingestion code had been working prior to some recent updates and introduced a problem where H2O is now pulling new copies of cases it already has as existing LegalDocuments. When I went to patch the logic I realized that status quo had been that cases were always been checked in CAP, but then rarely pulled, because the date field being checked was effective_date, which maps to decision_date, a value I suspect is meant to be immutable? (It's possible this was added to catch cases where the decision date was corrected, but the code wasn't commented either way and there weren't unit tests that touched that logic.)

Our conclusion is:

So this PR removes any checks with the upstream provider if the legal document already exists, and if it does, uses our most-recent version.

I'm open to restoring the check on effective_date, but I'd feel better about doing that if we could document why it might exist, and if we expect those dates to actually be changed upstream.

lizadaly commented 1 year ago

Will want to test a bit on staging before deploying.

jcushman commented 1 year ago

I haven't looked closely at any of this, but weighing in anyway. :) Confirming that decision_date in the CAP API is the date a case was decided, not an update date, and wouldn't make sense as a date check.

The way I'd recommend a downstream database pull updates from CAP is to use last_updated similar to how browsers use a cache request header: when you run the update script, you can query for something like cases?id__in=1,2,3&last_updated__gte=2023-04-01 to get cases on your list that have changed at all in the last month. (Have not checked API syntax, but something like that.) Then you take that list of potentially changed cases, and check each of them for whether the fields you care about have actually changed from your locally stored version, and if so do whatever it is you do with the updated data.

lizadaly commented 1 year ago

Thanks, that's helpful. The previous-used code had checks like:

most_recent_doc.effective_date <= most_recent_saved_doc.effective_date

https://github.com/harvard-lil/h2o/blob/develop/web/main/views.py#L2680-L2683

where effective_date for CAP content is definitely decision_date, so it does seem there was a time in which effective_date was a comparable timestamp but hasn't been for awhile.