Closed rebeccacremona closed 1 year ago
Patch coverage: 30.00%
and project coverage change: -0.98%
:warning:
Comparison is base (
e5a2943
) 69.72% compared to head (307f5e0
) 68.75%. Report is 5 commits behind head on develop.
:umbrella: View full report in Codecov by Sentry.
:loudspeaker: Have feedback on the report? Share it here.
(Followup: I am at least a little full of beans about the CNN capture timing.)
I tested these database migrations locally against a prod-like database: 3:28.18 with the data migration, 2:56.67 without it. Will be slower in prod.... but I think we go with it.
This PR uses the instance of the Scoop REST API added in https://github.com/harvard-lil/perma/pull/3381 to make Perma Links using Scoop, behind the global feature flag added in https://github.com/harvard-lil/perma/pull/3382.
Tests? Nope
It is untested, but feels relatively solid: I tried to code with an eye towards error-handling and edge cases, rather than focusing on the happy path.
I decided to fail hard if the Scoop REST API doesn't return data in exactly the format we're expecting, rather than being flexible: hopefully once we have tests, that will help make sure we adapt appropriately to changes.
Does it change the status quo?
It changes the status quo a very little bit, even with the feature flag off, because it adds new fields:
Link.captured_by_software
Link.captured_by_browser
CaptureJob.engine
CaptureJob.scoop_start_time
CaptureJob.scoop_end_time
CaptureJob.scoop_logs
CaptureJob.scoop_state
This shouldn't be a big deal.
I set
captured_by_software
to eitherperma
orupload
for all existing links during the migration, and make sure that happens going forward as well (via field defaults and https://github.com/harvard-lil/perma/pull/3386/files#diff-0950eb17b0561e9ae6b5a6cad00edcff38c9db5d41112e45427d90b73fa7fbf7R1883).The 4
scoop_*
fields are all nullable, so don't have to be set.And, I did not change the API to expose any of those fields (to be discussed).
I have a hard time imagining this causing any problems, especially since the test suite is passing as is. But! I thought it was worth mentioning.
About those new fields
I found it somewhat awkward, trying to figure out where to put all this new metadata. What belongs on
Link
? What belongs onCapture
? What belongs onCaptureJob
? I also wasn't crazy about the idea of adding Scoop-specific fields; in the abstract, I prefer fields likeengine
, which are open-ended.But... I tried not to make it too messy. Very open to suggestions on how to improve, or names to change, or etc.
The basic idea is:
This PR extracts out anything needed for our current API responses and business logic (page title, page description, content type), and a few other things I think it may prove interesting for us to track going forward: Scoop version, browser user agent, and Scoop "state" (complete or partial, for successful captures).
And, I decided to save Scoop-specific timings, in addition to the soup-to-nuts capture timings we currently track. That may prove unnecessary, since our logic should all be fast... but when I added it, I had a mistake that made our logic slow 🤣. So, it's probably good to have, just for us to keep an eye on.
For admins
The new fields are all displayed in the Django admin; you can filter on some of them if desired.
The full Scoop REST API logs are displayed in a nice JSON widget, which lets you expand/collapse sections and search. I couldn't get that working in the inline section of the
Link
admin; you have to go straight to theCaptureJob
admin... if that proves frustrating, I'll give it another try.The link "tray" now scrolls when long, and has two new admin-only features: "captured by", and a button to expose the replay controls.
More to do
We've been recording some known next steps in Linear; I won't rehash here. A couple that come to mind, obvious from this PR, are: look into target URL reformatting, figure out something for the progress bar, and see about screenshot quality.
But, I'll point out one other that comes to mind. While we have extensive tests that prove that Scoop is over all FASTER than Perma.... man, cnn.com, nytimes.com, etc. take long long. I am a little concerned that Perma's users will disproportionately capture slow sites, and would opt for less fidelity for a faster capture. I think we should keep our eye on it, and potentially think about letting people toggle on "low-fidelity" mode for particular captures or something. To be discussed.