oduwsdl / archivenow

A Tool To Push Web Resources Into Web Archives
MIT License
403 stars 42 forks source link

Archive sites in addition to submitting URIs #20

Closed machawk1 closed 5 years ago

machawk1 commented 6 years ago

One of the use cases in https://github.com/webrecorder/warcit is to grab a site's contents using wget then running the tool to create a WARC file from the local file contents. It would be useful for a tool called, "archivenow" to do more than submit URIs, rather, to perform some form of archiving itself.

I would like to propose replicating this model from the archivenow tool but in a single command. For example, running archivenow --warc=news.warc --agent=wget --ia http://cnn.com would use wget to create a WARC of cnn.com and store it locally at news.arc but also submit the URI to IA.

maturban commented 6 years ago

It is really nice to have "archivenow" create WARCs locally, not just pushing URLs to other archives. It is like pushing URLs into local archive in addition to the remote ones. I will definitely implement this as soon I can. Because this is written in Python, I would suggest using the module "requests" or any other Python module instead of "wget"! what do you think?

machawk1 commented 6 years ago

You will need to chase down all of the embedded resources w/ requests. wget does this for you and has native support for WARC output. If there was a Python equivalent of @N0taN3rd's https://github.com/n0tan3rd/node-warc, that would work well, too.

N0taN3rd commented 6 years ago

For Python side of controlling chrome without handling the raw websockets

maturban commented 5 years ago

I am closing this as we already included creating WARCs by Wget and Squidwarc