harvard-lil / perma

Indelible links
408 stars 72 forks source link

Tweak pages.jsonl produced when converting WARC -> WACZ #3535

Closed rebeccacremona closed 4 weeks ago

rebeccacremona commented 1 month ago

WACZ files include, zipped inside them, a file called pages.jsonl, which helps replay software know what the "entrypoint" URLs are for a given archive.

This PR makes the pages.jsonl produced during the conversion process more closely match the pages.jsonl produced during a Scoop WACZ capture of a target URL.

Before:

{"format": "json-pages-1.0", "id": "pages", "title": "All Pages"}
{"title": "primary capture url", "url": "http://example.com/", "ts": "2024-05-22 21:49:28.701508+00:00"}
{"title": "screenshot url", "url": "file:///screenshot.png", "ts": "2024-05-22 21:49:28.701508+00:00"}
{"title": "provenance summary url", "url": "file:///provenance-summary.html", "ts": "2024-05-22 21:49:28.701508+00:00"}

After:

{"format": "json-pages-1.0", "id": "pages", "title": "All Pages"}
{"url": "file:///provenance-summary.html", "title": "Provenance Summary", "ts": "2024-05-22 21:49:28.701508+00:00"}
{"url": "file:///screenshot.png", "title": "Capture Time Screenshot of http://example.com/", "ts": "2024-05-22 21:49:28.701508+00:00"}
{"url": "http://example.com/", "title": "High-Fidelity Web Capture of http://example.com/", "ts": "2024-05-22 21:49:28.701508+00:00"}

(We decided NOT to include the optional "id" field since it IS optional, and since it is primarily there to optimize performance when you have thousands or millions of pages... as opposed to 1-3, like us.)

See ENG-922.

codecov[bot] commented 1 month ago

Codecov Report

Attention: Patch coverage is 5.88235% with 16 lines in your changes missing coverage. Please review.

Project coverage is 69.50%. Comparing base (7d556e4) to head (ab146f3). Report is 63 commits behind head on develop.

Files Patch % Lines
perma_web/perma/models.py 9.09% 10 Missing :warning:
perma_web/perma/celery_tasks.py 0.00% 6 Missing :warning:
Additional details and impacted files ```diff @@ Coverage Diff @@ ## develop #3535 +/- ## =========================================== - Coverage 69.53% 69.50% -0.04% =========================================== Files 48 48 Lines 6785 6788 +3 =========================================== Hits 4718 4718 - Misses 2067 2070 +3 ```

:umbrella: View full report in Codecov by Sentry.
:loudspeaker: Have feedback on the report? Share it here.