internetarchive / warcprox

WARC writing MITM HTTP/S proxy
370 stars 54 forks source link

Thoughts on a custom browser solution for local research #111

Open hanoii opened 5 years ago

hanoii commented 5 years ago

I am researching various archiving tools for a use case of local researching and archiving. It should be easy to install and use by students so trying to find the best tool out there.

First question is whether you think this might end up working for such an use case.

The more I get into this the more I feel like the browser itself is the one who should be in full control of the archiving process. As far as the browser can browse a site, it should be able to archive it (and even reply it) properly. I wonder if you have ever considered this, know of someone who would and the general thought of this approach.

I am namely speaking of grabbing chromium or mozilla open source project and patching/working on top of it.

This is just an attempt to gather some thoughts and opinions from people much involved on archiving than currently myself is, so anything is appreciated.

Thanks.

anjackson commented 5 years ago

Maybe you’d like https://warcreate.com ?

ldko commented 5 years ago

There is also the https://webrecorder.io/ project if you haven't looked at that.

nlevitt commented 5 years ago

It is totally feasible to do web archiving by browsing through warcprox, as discussed on #110. Warcreate or webrecorder are also good options. The best choice depends on the details of your use case

hanoii commented 5 years ago

I tried both webrecorder and warcreate and none worked, at least with facebook, out of the box. Following up on some issues on both projects.

anjackson commented 5 years ago

@hanoii What are you using for playback? Even if the WARCs you make are fine, FB is a pain to play back properly.

hanoii commented 5 years ago

I tried https://github.com/webrecorder/webrecorder-player. Once I actually got warcprox to work it did work fairly well. Still need to try a few things out.

Do you know by any chance if warc store video streaming as well?

nlevitt commented 5 years ago

Yes streamed videos usually will be stored in the warc. Segmented videos are common these days, so playback is another question. It's possible pywb handles that already, don't know.