Capture with Scoop - Githubissues

rebeccacremona commented 1 year ago

This PR uses the instance of the Scoop REST API added in https://github.com/harvard-lil/perma/pull/3381 to make Perma Links using Scoop, behind the global feature flag added in https://github.com/harvard-lil/perma/pull/3382.

Tests? Nope

It is untested, but feels relatively solid: I tried to code with an eye towards error-handling and edge cases, rather than focusing on the happy path.

I decided to fail hard if the Scoop REST API doesn't return data in exactly the format we're expecting, rather than being flexible: hopefully once we have tests, that will help make sure we adapt appropriately to changes.

Does it change the status quo?

It changes the status quo a very little bit, even with the feature flag off, because it adds new fields:

Link.captured_by_software
Link.captured_by_browser
CaptureJob.engine
CaptureJob.scoop_start_time
CaptureJob.scoop_end_time
CaptureJob.scoop_logs
CaptureJob.scoop_state

This shouldn't be a big deal.

I set captured_by_software to either perma or upload for all existing links during the migration, and make sure that happens going forward as well (via field defaults and https://github.com/harvard-lil/perma/pull/3386/files#diff-0950eb17b0561e9ae6b5a6cad00edcff38c9db5d41112e45427d90b73fa7fbf7R1883).

The 4 scoop_* fields are all nullable, so don't have to be set.

And, I did not change the API to expose any of those fields (to be discussed).

I have a hard time imagining this causing any problems, especially since the test suite is passing as is. But! I thought it was worth mentioning.

About those new fields

I found it somewhat awkward, trying to figure out where to put all this new metadata. What belongs on Link? What belongs on Capture? What belongs on CaptureJob? I also wasn't crazy about the idea of adding Scoop-specific fields; in the abstract, I prefer fields like engine, which are open-ended.

But... I tried not to make it too messy. Very open to suggestions on how to improve, or names to change, or etc.

The basic idea is:

keep track of who made a Perma Link's WARC's contents, Perma, Scoop, or some user
keep a copy of all the data sent to us from the Scoop REST API, untouched, in a JSON field
extract whatever bits of that data out of the larger data structure and into convenience fields as desired, because JSON fields are awkward to work with/hard to index, etc.

This PR extracts out anything needed for our current API responses and business logic (page title, page description, content type), and a few other things I think it may prove interesting for us to track going forward: Scoop version, browser user agent, and Scoop "state" (complete or partial, for successful captures).

And, I decided to save Scoop-specific timings, in addition to the soup-to-nuts capture timings we currently track. That may prove unnecessary, since our logic should all be fast... but when I added it, I had a mistake that made our logic slow 🤣. So, it's probably good to have, just for us to keep an eye on.

For admins

The new fields are all displayed in the Django admin; you can filter on some of them if desired.
The full Scoop REST API logs are displayed in a nice JSON widget, which lets you expand/collapse sections and search. I couldn't get that working in the inline section of the Link admin; you have to go straight to the CaptureJob admin... if that proves frustrating, I'll give it another try.
The link "tray" now scrolls when long, and has two new admin-only features: "captured by", and a button to expose the replay controls.

More to do

We've been recording some known next steps in Linear; I won't rehash here. A couple that come to mind, obvious from this PR, are: look into target URL reformatting, figure out something for the progress bar, and see about screenshot quality.

But, I'll point out one other that comes to mind. While we have extensive tests that prove that Scoop is over all FASTER than Perma.... man, cnn.com, nytimes.com, etc. take long long. I am a little concerned that Perma's users will disproportionately capture slow sites, and would opt for less fidelity for a faster capture. I think we should keep our eye on it, and potentially think about letting people toggle on "low-fidelity" mode for particular captures or something. To be discussed.

codecov[bot] commented 1 year ago

Codecov Report

Patch coverage: 30.00% and project coverage change: -0.98% :warning:

Comparison is base (e5a2943) 69.72% compared to head (307f5e0) 68.75%. Report is 5 commits behind head on develop.

Additional details and impacted files

```diff @@ Coverage Diff @@ ## develop #3386 +/- ## =========================================== - Coverage 69.72% 68.75% -0.98% =========================================== Files 53 53 Lines 7106 7263 +157 =========================================== + Hits 4955 4994 +39 - Misses 2151 2269 +118 ``` | [Files Changed](https://app.codecov.io/gh/harvard-lil/perma/pull/3386?src=pr&el=tree&utm_medium=referral&utm_source=github&utm_content=comment&utm_campaign=pr+comments&utm_term=harvard-lil) | Coverage Δ | | |---|---|---| | [perma\_web/perma/celery\_tasks.py](https://app.codecov.io/gh/harvard-lil/perma/pull/3386?src=pr&el=tree&utm_medium=referral&utm_source=github&utm_content=comment&utm_campaign=pr+comments&utm_term=harvard-lil#diff-cGVybWFfd2ViL3Blcm1hL2NlbGVyeV90YXNrcy5weQ==) | `49.33% <5.66%> (-3.78%)` | :arrow_down: | | [perma\_web/perma/utils.py](https://app.codecov.io/gh/harvard-lil/perma/pull/3386?src=pr&el=tree&utm_medium=referral&utm_source=github&utm_content=comment&utm_campaign=pr+comments&utm_term=harvard-lil#diff-cGVybWFfd2ViL3Blcm1hL3V0aWxzLnB5) | `61.41% <15.00%> (-2.31%)` | :arrow_down: | | [perma\_web/perma/models.py](https://app.codecov.io/gh/harvard-lil/perma/pull/3386?src=pr&el=tree&utm_medium=referral&utm_source=github&utm_content=comment&utm_campaign=pr+comments&utm_term=harvard-lil#diff-cGVybWFfd2ViL3Blcm1hL21vZGVscy5weQ==) | `86.34% <93.33%> (+0.05%)` | :arrow_up: | | [perma\_web/perma/admin.py](https://app.codecov.io/gh/harvard-lil/perma/pull/3386?src=pr&el=tree&utm_medium=referral&utm_source=github&utm_content=comment&utm_campaign=pr+comments&utm_term=harvard-lil#diff-cGVybWFfd2ViL3Blcm1hL2FkbWluLnB5) | `83.58% <95.65%> (+0.43%)` | :arrow_up: | | [perma\_web/api/views.py](https://app.codecov.io/gh/harvard-lil/perma/pull/3386?src=pr&el=tree&utm_medium=referral&utm_source=github&utm_content=comment&utm_campaign=pr+comments&utm_term=harvard-lil#diff-cGVybWFfd2ViL2FwaS92aWV3cy5weQ==) | `84.59% <100.00%> (+0.03%)` | :arrow_up: | | [perma\_web/perma/exceptions.py](https://app.codecov.io/gh/harvard-lil/perma/pull/3386?src=pr&el=tree&utm_medium=referral&utm_source=github&utm_content=comment&utm_campaign=pr+comments&utm_term=harvard-lil#diff-cGVybWFfd2ViL3Blcm1hL2V4Y2VwdGlvbnMucHk=) | `100.00% <100.00%> (ø)` | |

:umbrella: View full report in Codecov by Sentry.
:loudspeaker: Have feedback on the report? Share it here.

rebeccacremona commented 1 year ago

(Followup: I am at least a little full of beans about the CNN capture timing.)

rebeccacremona commented 1 year ago

I tested these database migrations locally against a prod-like database: 3:28.18 with the data migration, 2:56.67 without it. Will be slower in prod.... but I think we go with it.

harvard-lil / perma

Capture with Scoop #3386

Tests? Nope

Does it change the status quo?

About those new fields

For admins

More to do

Codecov Report