johnhawkinson commented 6 years ago

There should be a mechanism to return the CL ~document~ docket URL so, e.g, @Big_Cases can put it in the DocumentCloud metadata so that documents can have a link to the entire docket.

jabagawee commented 6 years ago

Obtaining the docket URL

When uploading a document, there are three possible cases:

need_fake_case == True and need_fake_entry == True,
need_fake_case == False and need_fake_entry == True, or
need_fake_case == False and need_fake_entry == False.

Note that the code does not allow for need_fake_case to be True and need_fake_entry to be False, which makes sense because a missing case implies that there is no place for a docket entry to exist and we will need to make a fake one.

In case 3, we have a docket id. If we were willing to make another round-trip call to the CourtListener API, we could call /dockets/{id} and parse out the absolute_url field to return the docket URL.

For the other two cases, we don't have the docket id. We would need to perform a blocking wait until the fake entry is successfully created. This would mean reading the Processing Queue object returned from the docket upload and then polling it until the docket field has the docket id filled out, which we can then use to call the /dockets/{id} endpoint and continue as above.

Obtaining the document URL

If we wanted to obtain the document URL, we would have to poll the Processing Queue object as above after the PDF upload and then query the API after we get the recap_document field filled out. I'm not sure if we actually want this, because https://github.com/bdheath/Big-Cases/issues/4 seems to ask only for the docket URL.

Implementation

We're lucky that the return type of RecapUpload.__init__() is NoneType right now, which means we can make changes without affecting any callers, but as we saw above, adding docket/document URLs to the output would incur some pretty heavy time costs to running the method (waiting for CourtListener and RECAP to finish processing).

I recommend adding a separate method to grab the URLs that would do the blocking calls to CourtListener's API. That way, any callers would pay the extra cost of that wait without polluting the main entrypoint method.

Thoughts? I'm happy to write that out to see what it would look like.

mlissner commented 6 years ago

So far, the performance of these APIs is usually within about a second or two. Maybe blocking isn't so bad. What if we gave it 10s, say, and then just gave up if the processing wasn't done by then?

johnhawkinson commented 6 years ago

In case 3, we have a docket id. If we were willing to make another round-trip call to the CourtListener API, we could call /dockets/{id} and parse out the absolute_url field to return the docket URL.

That's not the only solution. There's a good argument for a CL endpoint that returns the docket page directly to the user given a court and pacer_case_id. This was the IA model and it was very powerful and the lack of it is a continued and frequent source of frustration for me as a user.

That is, given a filename of gov.uscourts.cand.308136.1746.0.pdf which identifies court and pacer_case_id, it was easy to construct the docket page URL: http://www.archive.org/download/gov.uscourts.cand.308136.

(Or depending on your preference, http://archive.org/download/gov.uscourts.cand.308136/gov.uscourts.cand.308136.docket.html).

CL and recap-chrome make this hard because:

The RECAP extension's default filename is now "Lawyer-style" which throws away the pacer_case_id in favor of the docket number (and transforms court): N.D.Cal._3‑17‑cv‑00939_1746_0.pdf(This isn't an issue for recapupload).
The CL docket URL is something like https://www.courtlistener.com/docket/4609586/waymo-llc-v-uber-technologies-inc/?filed_after=&filed_before=&entry_gte=1746&order_by=asc which has keys we don't know (also there is a pagination problem relating to server architecture and caching/pregenerated file choices).

But given that the PDF is available at https://www.courtlistener.com/recap/gov.uscourts.cand.308136.1746.0.pdf the docket should also be available at https://www.courtlistener.com/recap/gov.uscourts.cand.308136.

I thought I had a CL Issue open for this, but I can't seem to find it. (I just barely touch on the "url hacking" issue briefly in https://github.com/freelawproject/recap/issues/61#issuecomment-350947254 and again in https://github.com/freelawproject/courtlistener/issues/783#issuecomment-354473385; I come close in https://github.com/freelawproject/recap/issues/195; ) Somebody stop me before I file a new issue...

For the other two cases, we don't have the docket id.

Case 2 is the common case — only the entry is new. Case 1 (a new docket) and case 3 (new PDF only) are pretty rare.

In the Big Cases context, failing to link for cases 1 and 3 would probably be fine (this is not super-helpful).

We would need to perform a blocking wait until the fake entry is successfully created. This would mean reading the Processing Queue object returned from the docket upload and then polling it [...] I'm not sure if we actually want this, because bdheath/Big-Cases#4 seems to ask only for the docket URL.

We don't, sorry.

The title of this issue is correct, and I misspoke in the text of the first comment (see edit). Links to CL document pages are pretty uninteresting, and arguably counterproductive. The link to the CL docket page is valuable and desirable.

Implementation We're lucky that the return type of RecapUpload.__init__() is NoneType right now, which means we can make changes without affecting any callers,

There is only one caller, changing the API is just fine. We don't have a Makefile problem.

I recommend adding a separate method to grab the URLs that would do the blocking calls to CourtListener's API. That way, any callers would pay the extra cost of that wait without polluting the main entrypoint method. Thoughts? I'm happy to write that out to see what it would look like.

There's no compelling use case for this right now. You can file a separate issue for it and make the case for it, but it doesn't seem worth much effort and definitely not worth blocking the Big_cases bot for thousands of milliseconds (also known as "seconds").

Let's keep this issue focused on returning docket URLs, not document URLs.

jabagawee commented 6 years ago

That's not the only solution. There's a good argument for a CL endpoint that returns the docket page directly to the user given a court and pacer_case_id. This was the IA model and it was very powerful and the lack of it is a continued and frequent source of frustration for me as a user.

Isn't there such an endpoint that we are calling right now? Lines 147-154 seems to call a CourtListener endpoint with a court and pacer_case_id and retrieves a docket object, which contains the absolute URL. If you mean that CL should make docket URLs more guessable, that sounds like a much larger feature request that might take some time from them.

Since Case 2 is the common case, this means that we should have access to the docket object at line 154 (the same HTTP request that I linked earlier, which should be a successful call). This is really good news! From there, it seems simple to grab the absolute_uri form that response and return it. I'll have a proposed solution soon.

johnhawkinson commented 6 years ago

Isn't there such an endpoint that we are calling right now?

As I said, "returns the docket page directly to the user". The /dockets API endpoint returns structured information that contains a URL, it does not return a 302 redirect or HTML.

If you mean that CL should make docket URLs more guessable, that sounds like a much larger feature request that might take some time from them.

Also true. Hence the comment that I thought I had a CL Issue about this.

From there, it seems simple to grab the absolute_uri form that response and return it.

Yeah. Like I tried to suggest in https://github.com/bdheath/Big-Cases/issues/4#issuecomment-361733180, the only reason hasn't been done is I'm not sure how Brad feels about it and there's not much point in doing it if he doesn't want to use it. And also thinking about structuring the API is more work than implementing it.

johnhawkinson commented 6 years ago

I thought I had a CL Issue open for this, but I can't seem to find it. (I just barely touch on the "url hacking" issue briefly in https://github.com/freelawproject/recap/issues/61#issuecomment-350947254 and again in https://github.com/freelawproject/courtlistener/issues/783#issuecomment-354473385; I come close in https://github.com/freelawproject/recap/issues/195; ) Somebody stop me before I file a new issue...

Ah, right! It's https://github.com/freelawproject/courtlistener/issues/771 with some added background in https://github.com/freelawproject/recap/issues/194.

johnhawkinson / recapupload

Return CL docket URL #2

Obtaining the docket URL

Obtaining the document URL

Implementation