oduwsdl / archivenow

A Tool To Push Web Resources Into Web Archives
MIT License
403 stars 42 forks source link

Archive Web Site #29

Closed sirinath closed 5 years ago

sirinath commented 5 years ago

Can you add the ability to archive a complete web site

Some files may be document files like doc, pdf with links.

maturban commented 5 years ago

Hi Sirinath,

Pushing a web site is kind of tricky because it requires ArchiveNow to: (1) Download the site locally using some sort of crawlers
(2) Extract all URIs of web pages in the site (3) Push those URIs into archives

Although it is doable, it might result in sending too many requests to the archive.

Initially, what you can do is to download a site into local WARC file(s) using Wget or even better using Squidwarc which can discover more resources after executing JS. Then, extract all URIs of web pages from the WARC file(s) and finally submit those URIs one by one to archives using ArchiveNow.

This idea is mainly suggested by @machawk1

Best,

Mohamed

sirinath commented 5 years ago

I believe this could be done if there is an integration example handler with Scrapy which can be customised by the user. This can even be hosted in Scrapinghub where a simple job can be done to do the pushing without even having to run it locally.

sirinath commented 5 years ago

For PDF and Doc processing I found:

machawk1 commented 5 years ago

As we discussed @maturban, it may come down to:

  1. Discoverability of URIs that constitute a "complete website"
  2. The ability to surface additional URIs of embedded resources
  3. Mitigating the inevitable throttling that will occur when attempt to submit many URIs to archives at a reasonable pace.

⁠2 would benefit from a browser-based system as you referenced with Squidwarc, but the overhead of generating WARCs from this content seems like an unnecessary burden for someone wanting to submit URIs.

I have not used Scrapy in a while, as suggested by @sirinath, but its relative capability of rendering the page (w/ regard to JS) will likely hinder the completeness of the set of URIs, individual pages, and thus complete web sites.

I also recall there being policies from some archives as to what sort of content-types they retain, e.g., does IA allow submission of URIs of DOCs and PDFs?