sul-dlss / was-pywb

Configuration for Stanford's pywb instance
https://swap.stanford.edu
Other
2 stars 0 forks source link

Archived sul-embed resources not replaying #192

Open edsu opened 1 year ago

edsu commented 1 year ago

While archiving library.stanford.edu with browsertrix-crawler we discovered that embedded SDR viewers on blog posts don’t display and give an error, see:

https://swap.stanford.edu/was/20230505152659/https://library.stanford.edu/blogs/special-collections-unbound/2022/11/born-digital-collections-opened-research-2022

Screenshot 2023-05-24 at 6 01 52 PM

While a crawl of the same page appears to work in ArchiveIt, it is showing the embeds from the live web rather than the capture (which those resources are missing from). Here’s what the browser dev tools Network panel looks like when viewing the Archive-it capture:

Screenshot 2023-05-24 at 12 28 26 PM (1)

While some of the embedded iframe has been captured, looks like maybe some critical resources for rendering were not? For example: https://purl.stanford.edu/pq546tq4448/iiif/manifest is not going through swap. These resources are not loaded by the browser on page load, but only when they scroll into view:

https://github.com/sul-dlss/was-pywb/assets/33829/b7a145e0-62dc-4fa6-9214-bf939a8abd0f

So it appears that browsertrix-crawler was not configured to scroll the page?

edsu commented 1 year ago

I've opened https://github.com/webrecorder/browsertrix-crawler/issues/321 since it looks like browsertrix-crawler is not autoscrolling this page even when it is explicitly told to.