Open johnhawkinson opened 6 years ago
When uploading a document, there are three possible cases:
need_fake_case == True
and need_fake_entry == True
,need_fake_case == False
and need_fake_entry == True
, orneed_fake_case == False
and need_fake_entry == False
.Note that the code does not allow for need_fake_case
to be True
and need_fake_entry
to be False
, which makes sense because a missing case implies that there is no place for a docket entry to exist and we will need to make a fake one.
In case 3, we have a docket id. If we were willing to make another round-trip call to the CourtListener API, we could call /dockets/{id}
and parse out the absolute_url
field to return the docket URL.
For the other two cases, we don't have the docket id. We would need to perform a blocking wait until the fake entry is successfully created. This would mean reading the Processing Queue
object returned from the docket upload and then polling it until the docket
field has the docket id filled out, which we can then use to call the /dockets/{id}
endpoint and continue as above.
If we wanted to obtain the document URL, we would have to poll the Processing Queue
object as above after the PDF upload and then query the API after we get the recap_document
field filled out. I'm not sure if we actually want this, because https://github.com/bdheath/Big-Cases/issues/4 seems to ask only for the docket URL.
We're lucky that the return type of RecapUpload.__init__()
is NoneType
right now, which means we can make changes without affecting any callers, but as we saw above, adding docket/document URLs to the output would incur some pretty heavy time costs to running the method (waiting for CourtListener and RECAP to finish processing).
I recommend adding a separate method to grab the URLs that would do the blocking calls to CourtListener's API. That way, any callers would pay the extra cost of that wait without polluting the main entrypoint method.
Thoughts? I'm happy to write that out to see what it would look like.
So far, the performance of these APIs is usually within about a second or two. Maybe blocking isn't so bad. What if we gave it 10s, say, and then just gave up if the processing wasn't done by then?
In case 3, we have a docket id. If we were willing to make another round-trip call to the CourtListener API, we could call /dockets/{id} and parse out the absolute_url field to return the docket URL.
That's not the only solution. There's a good argument for a CL endpoint that returns the docket page directly to the user given a court
and pacer_case_id
. This was the IA model and it was very powerful and the lack of it is a continued and frequent source of frustration for me as a user.
That is, given a filename of gov.uscourts.cand.308136.1746.0.pdf
which identifies court
and pacer_case_id
, it was easy to construct the docket page URL: http://www.archive.org/download/gov.uscourts.cand.308136
.
(Or depending on your preference, http://archive.org/download/gov.uscourts.cand.308136/gov.uscourts.cand.308136.docket.html
).
CL and recap-chrome
make this hard because:
pacer_case_id
in favor of the docket number (and transforms court
): N.D.Cal._3‑17‑cv‑00939_1746_0.pdf
(This isn't an issue for recapupload
).https://www.courtlistener.com/docket/4609586/waymo-llc-v-uber-technologies-inc/?filed_after=&filed_before=&entry_gte=1746&order_by=asc
which has keys we don't know (also there is a pagination problem relating to server architecture and caching/pregenerated file choices).But given that the PDF is available at https://www.courtlistener.com/recap/gov.uscourts.cand.308136.1746.0.pdf
the docket should also be available at https://www.courtlistener.com/recap/gov.uscourts.cand.308136
.
I thought I had a CL Issue open for this, but I can't seem to find it. (I just barely touch on the "url hacking" issue briefly in https://github.com/freelawproject/recap/issues/61#issuecomment-350947254 and again in https://github.com/freelawproject/courtlistener/issues/783#issuecomment-354473385; I come close in https://github.com/freelawproject/recap/issues/195; ) Somebody stop me before I file a new issue...
For the other two cases, we don't have the docket id.
Case 2 is the common case — only the entry is new. Case 1 (a new docket) and case 3 (new PDF only) are pretty rare.
In the Big Cases context, failing to link for cases 1 and 3 would probably be fine (this is not super-helpful).
We would need to perform a blocking wait until the fake entry is successfully created. This would mean reading the Processing Queue object returned from the docket upload and then polling it [...] I'm not sure if we actually want this, because bdheath/Big-Cases#4 seems to ask only for the docket URL.
We don't, sorry.
The title of this issue is correct, and I misspoke in the text of the first comment (see edit). Links to CL document pages are pretty uninteresting, and arguably counterproductive. The link to the CL docket page is valuable and desirable.
Implementation We're lucky that the return type of
RecapUpload.__init__()
is NoneType right now, which means we can make changes without affecting any callers,
There is only one caller, changing the API is just fine. We don't have a Makefile problem.
I recommend adding a separate method to grab the URLs that would do the blocking calls to CourtListener's API. That way, any callers would pay the extra cost of that wait without polluting the main entrypoint method. Thoughts? I'm happy to write that out to see what it would look like.
There's no compelling use case for this right now. You can file a separate issue for it and make the case for it, but it doesn't seem worth much effort and definitely not worth blocking the Big_cases bot for thousands of milliseconds (also known as "seconds").
Let's keep this issue focused on returning docket URLs, not document URLs.
That's not the only solution. There's a good argument for a CL endpoint that returns the docket page directly to the user given a court and pacer_case_id. This was the IA model and it was very powerful and the lack of it is a continued and frequent source of frustration for me as a user.
Isn't there such an endpoint that we are calling right now? Lines 147-154 seems to call a CourtListener endpoint with a court and pacer_case_id and retrieves a docket object, which contains the absolute URL. If you mean that CL should make docket URLs more guessable, that sounds like a much larger feature request that might take some time from them.
Since Case 2 is the common case, this means that we should have access to the docket object at line 154 (the same HTTP request that I linked earlier, which should be a successful call). This is really good news! From there, it seems simple to grab the absolute_uri form that response and return it. I'll have a proposed solution soon.
Isn't there such an endpoint that we are calling right now?
As I said, "returns the docket page directly to the user".
The /dockets
API endpoint returns structured information that contains a URL, it does not return a 302 redirect
or HTML.
If you mean that CL should make docket URLs more guessable, that sounds like a much larger feature request that might take some time from them.
Also true. Hence the comment that I thought I had a CL Issue about this.
From there, it seems simple to grab the absolute_uri form that response and return it.
Yeah. Like I tried to suggest in https://github.com/bdheath/Big-Cases/issues/4#issuecomment-361733180, the only reason hasn't been done is I'm not sure how Brad feels about it and there's not much point in doing it if he doesn't want to use it. And also thinking about structuring the API is more work than implementing it.
I thought I had a CL Issue open for this, but I can't seem to find it. (I just barely touch on the "url hacking" issue briefly in https://github.com/freelawproject/recap/issues/61#issuecomment-350947254 and again in https://github.com/freelawproject/courtlistener/issues/783#issuecomment-354473385; I come close in https://github.com/freelawproject/recap/issues/195; ) Somebody stop me before I file a new issue...
Ah, right! It's https://github.com/freelawproject/courtlistener/issues/771 with some added background in https://github.com/freelawproject/recap/issues/194.
There should be a mechanism to return the CL ~document~ docket URL so, e.g,
@Big_Cases
can put it in the DocumentCloud metadata so that documents can have a link to the entire docket.