harvard-lil / perma

Indelible links
413 stars 70 forks source link

Capture with Scoop #3386

Closed rebeccacremona closed 1 year ago

rebeccacremona commented 1 year ago

This PR uses the instance of the Scoop REST API added in https://github.com/harvard-lil/perma/pull/3381 to make Perma Links using Scoop, behind the global feature flag added in https://github.com/harvard-lil/perma/pull/3382.

Tests? Nope

It is untested, but feels relatively solid: I tried to code with an eye towards error-handling and edge cases, rather than focusing on the happy path.

I decided to fail hard if the Scoop REST API doesn't return data in exactly the format we're expecting, rather than being flexible: hopefully once we have tests, that will help make sure we adapt appropriately to changes.

Does it change the status quo?

It changes the status quo a very little bit, even with the feature flag off, because it adds new fields:

This shouldn't be a big deal.

I set captured_by_software to either perma or upload for all existing links during the migration, and make sure that happens going forward as well (via field defaults and https://github.com/harvard-lil/perma/pull/3386/files#diff-0950eb17b0561e9ae6b5a6cad00edcff38c9db5d41112e45427d90b73fa7fbf7R1883).

The 4 scoop_* fields are all nullable, so don't have to be set.

And, I did not change the API to expose any of those fields (to be discussed).

I have a hard time imagining this causing any problems, especially since the test suite is passing as is. But! I thought it was worth mentioning.

About those new fields

I found it somewhat awkward, trying to figure out where to put all this new metadata. What belongs on Link? What belongs on Capture? What belongs on CaptureJob? I also wasn't crazy about the idea of adding Scoop-specific fields; in the abstract, I prefer fields like engine, which are open-ended.

But... I tried not to make it too messy. Very open to suggestions on how to improve, or names to change, or etc.

The basic idea is:

This PR extracts out anything needed for our current API responses and business logic (page title, page description, content type), and a few other things I think it may prove interesting for us to track going forward: Scoop version, browser user agent, and Scoop "state" (complete or partial, for successful captures).

And, I decided to save Scoop-specific timings, in addition to the soup-to-nuts capture timings we currently track. That may prove unnecessary, since our logic should all be fast... but when I added it, I had a mistake that made our logic slow 🤣. So, it's probably good to have, just for us to keep an eye on.

For admins

More to do

We've been recording some known next steps in Linear; I won't rehash here. A couple that come to mind, obvious from this PR, are: look into target URL reformatting, figure out something for the progress bar, and see about screenshot quality.

But, I'll point out one other that comes to mind. While we have extensive tests that prove that Scoop is over all FASTER than Perma.... man, cnn.com, nytimes.com, etc. take long long. I am a little concerned that Perma's users will disproportionately capture slow sites, and would opt for less fidelity for a faster capture. I think we should keep our eye on it, and potentially think about letting people toggle on "low-fidelity" mode for particular captures or something. To be discussed.

codecov[bot] commented 1 year ago

Codecov Report

Patch coverage: 30.00% and project coverage change: -0.98% :warning:

Comparison is base (e5a2943) 69.72% compared to head (307f5e0) 68.75%. Report is 5 commits behind head on develop.

Additional details and impacted files ```diff @@ Coverage Diff @@ ## develop #3386 +/- ## =========================================== - Coverage 69.72% 68.75% -0.98% =========================================== Files 53 53 Lines 7106 7263 +157 =========================================== + Hits 4955 4994 +39 - Misses 2151 2269 +118 ``` | [Files Changed](https://app.codecov.io/gh/harvard-lil/perma/pull/3386?src=pr&el=tree&utm_medium=referral&utm_source=github&utm_content=comment&utm_campaign=pr+comments&utm_term=harvard-lil) | Coverage Δ | | |---|---|---| | [perma\_web/perma/celery\_tasks.py](https://app.codecov.io/gh/harvard-lil/perma/pull/3386?src=pr&el=tree&utm_medium=referral&utm_source=github&utm_content=comment&utm_campaign=pr+comments&utm_term=harvard-lil#diff-cGVybWFfd2ViL3Blcm1hL2NlbGVyeV90YXNrcy5weQ==) | `49.33% <5.66%> (-3.78%)` | :arrow_down: | | [perma\_web/perma/utils.py](https://app.codecov.io/gh/harvard-lil/perma/pull/3386?src=pr&el=tree&utm_medium=referral&utm_source=github&utm_content=comment&utm_campaign=pr+comments&utm_term=harvard-lil#diff-cGVybWFfd2ViL3Blcm1hL3V0aWxzLnB5) | `61.41% <15.00%> (-2.31%)` | :arrow_down: | | [perma\_web/perma/models.py](https://app.codecov.io/gh/harvard-lil/perma/pull/3386?src=pr&el=tree&utm_medium=referral&utm_source=github&utm_content=comment&utm_campaign=pr+comments&utm_term=harvard-lil#diff-cGVybWFfd2ViL3Blcm1hL21vZGVscy5weQ==) | `86.34% <93.33%> (+0.05%)` | :arrow_up: | | [perma\_web/perma/admin.py](https://app.codecov.io/gh/harvard-lil/perma/pull/3386?src=pr&el=tree&utm_medium=referral&utm_source=github&utm_content=comment&utm_campaign=pr+comments&utm_term=harvard-lil#diff-cGVybWFfd2ViL3Blcm1hL2FkbWluLnB5) | `83.58% <95.65%> (+0.43%)` | :arrow_up: | | [perma\_web/api/views.py](https://app.codecov.io/gh/harvard-lil/perma/pull/3386?src=pr&el=tree&utm_medium=referral&utm_source=github&utm_content=comment&utm_campaign=pr+comments&utm_term=harvard-lil#diff-cGVybWFfd2ViL2FwaS92aWV3cy5weQ==) | `84.59% <100.00%> (+0.03%)` | :arrow_up: | | [perma\_web/perma/exceptions.py](https://app.codecov.io/gh/harvard-lil/perma/pull/3386?src=pr&el=tree&utm_medium=referral&utm_source=github&utm_content=comment&utm_campaign=pr+comments&utm_term=harvard-lil#diff-cGVybWFfd2ViL3Blcm1hL2V4Y2VwdGlvbnMucHk=) | `100.00% <100.00%> (ø)` | |

:umbrella: View full report in Codecov by Sentry.
:loudspeaker: Have feedback on the report? Share it here.

rebeccacremona commented 1 year ago

(Followup: I am at least a little full of beans about the CNN capture timing.) image

rebeccacremona commented 1 year ago

I tested these database migrations locally against a prod-like database: 3:28.18 with the data migration, 2:56.67 without it. Will be slower in prod.... but I think we go with it.