add first draft of CL search

teovin commented 3 months ago

This is a WIP PR for the CL case XML -> HTML conversion integration.

Things that were done:

Activated the CourtListener legal document source in Django admin.
Added logic to grab xml_harvard fields from the opinions endpoint if the cluster has a filepath_json_harvard, otherwise the html field will be used (with plain_text as worst case scenario).
Enabled advanced search for CourtListener source by adding additional params logic to the source calls.
Added Jack's xml to html conversion script.
Ran the script against 4000 clusters/cases grabbed from CL API (2000 with source U, 2000 with source CU). Source descriptions here.
Made a change to the conversion script to handle cases where elements might be missing type and id attributes in the source xml.

A few bug fixes were made:

Handle cases where clusters might not have any citations. This was causing the search to error.
Update the format citations are being added to the legal doc. Previously they were being added in a json format instead of a list which didn't match how we display citations for other legal doc sources. Sample diff:

Made an update to the effectiveDate to prevent errors that are thrown if the CL API returns a date string longer than 25 chars, I saw that was the case for some clusters with the time offset including seconds.
Updated the search result ids with the cluster_ids. Because the ids and the cluster_ids do not match in the search endpoint results, the subsequent clusters endpoint call with id was erroring out.
Updated the opinions endpoint calls to use opinion ids (grabbed from cluster endpoint response sub_opinions field) since we need to look at all sub_opinions to construct the Harvard xml. Previously the search result id was being used.
I saw some cases where there wasn't any xml_harvard data, and no html, so I defaulted to use the plain_text field of the opinion.

Things to consider:

I mapped the cluster call response json to the metadata field like we do for other search sources. One thing I see Cap does is add the footnote regexes to metadata. Is this needed for CL?
I noticed we are hiding some elements like .parties, .decisiondate and .docketnumber in case-text class. What's the reasoning behind this?
I saw that some opinions don't have either of the content fields (xml_harvard, plain_text). Think about what to do in this case. Can we fall back on other html fields? Or disable importing for those documents?
Any edge cases that I should consider? Do some more testing around those.

Sample converted legal doc (chopped):

This is how it would look like if the elements I mentioned above weren't set to `display: none`.

A case that both CAP and CL return, and this is how they look like when imported (both chopped):

CAP (with display: none removed from .case-text .syllabus):

CAP

CourtListener (with display: none removed from elements in headmatter):

teovin commented 2 months ago

I just did a quick pass for code style, and LGTM! Left two tiny suggestions 🙂

I also took the liberty of adding Jack as a reviewer, who I expect might be more equipped than me to address your more detailed questions 🙂

Thank you Becky, I addressed your suggestions in my last commit. And I will work on any changes that Jack might suggest, especially those around the questions I had as you mentioned.

jcushman commented 2 months ago

I noticed we are hiding some elements like .parties, .decisiondate and .docketnumber in case-text class. What's the reasoning behind this?

We want to render the top part of the head matter ourselves, rather than use the info printed in the book -- that lets us provide more consistent formatting between cases published in different books. Check out cap_header.html for where that's done. My guess is you have to adapt that business logic to also work with CL.

So some fields are hidden because the custom header makes them redundant. I wasn't part of this, but I'm guessing we're hiding other fields like syllabus and parties simply for user preference. As long as we're rendering the same as cases fetched from the CAP API, let's not revisit that decision for now.

jcushman commented 2 months ago

I haven't looked if you're doing this yet -- I think we'll want to record which courtlistener field was used to populate the case. For example I'm pretty sure if we do need footnote_regexes, we only need it if xml_harvard was the source.

jcushman commented 2 months ago

This looks great -- I think with updates it'll be good to test on stage.

jcushman commented 2 months ago

... but we might want a feature flag since xml conversion isn't ready yet.

teovin commented 2 months ago

We want to render the top part of the head matter ourselves, rather than use the info printed in the book -- that lets us provide more consistent formatting between cases published in different books. Check out cap_header.html for where that's done. My guess is you have to adapt that business logic to also work with CL.

So some fields are hidden because the custom header makes them redundant. I wasn't part of this, but I'm guessing we're hiding other fields like syllabus and parties simply for user preference. As long as we're rendering the same as cases fetched from the CAP API, let's not revisit that decision for now.

I added a template for court listener modeling it after cap_header.html. One change I made to both was to remove the div with legal_doc.get_title as get_title method didn't exist, and so it wasn't rendering anything.

harvard-lil / h2o