machawk1 / warcreate

Chrome extension to "Create WARC files from any webpage"
https://warcreate.com
MIT License
205 stars 13 forks source link

[discussion/thought] Would a custom browser solution work better in terms of capabilities/UI than most current tools/proxies? #112

Open hanoii opened 5 years ago

hanoii commented 5 years ago

The more I get into this the more I feel like the browser itself is the one who should be in full control of the archiving process. As far as the browser can browse a site, it should be able to archive it (and even reply it) properly. I wonder if you have ever considered this, know of someone who would and the general thought of this approach.

I am namely speaking of grabbing chromium or mozilla open source project and patching/working on top of it.

Just a general thought but opinions from people much involved on archiving than currently myself are something I value a lot.

machawk1 commented 5 years ago

@hanoii I agree. The tool one uses to view the Web ought to be the same tool that is used to archive it. Daily driving browsers (e.g., Chrome and Firefox) do not natively support writing or read WARCs.

Tools exist that leverage a separate browser/tool to generate a WARC, but needing to switch tools to archive is not ideal and a reason WARCreate exists -- to allow the same tools one uses daily (here, Chrome) to also archive the Web. Having an ad hoc fork of Chromium with WARC support, while interesting, suffers from the same ad hoc problem.

The good news is that since WARCreate's inception (~2011 😅), some tools are moving toward leveraging your regular browser experience for preservation. For example, Squidwarc is working toward using your browser's own cookies to provide an additional rich result of captures behind authentication.

Thanks for the feedback and interest. I welcome further discussion.

hanoii commented 5 years ago

@machawk1 thanks for the reply.

For me having a separate tool is not that much of a problem, it actually a benefit for me as it allows to compartment the archival process on its own process, helps with privacy as you can log in there with different credentials, etc and it allows for add on tooling with independence on catching up with new technologies and site anti-crawling techniques, but I agree if all can live on the same tool, it's interesting.

Forking chrome is still something I am just toying with, although I can see it being not an easy work.

The good thing of having your own tool is that you can do a lot of stuff, not just creating the warc file, screenshots, annotations and even access to external tools (youtube-dl) becomes possible as in an extension is always limited to what the host allows.

I still think this is a great approach and I am looking forward to trying the new version as you mentioned that chrome store has a stale on #111.

Also I kind of need installing it/using not to be too hard as this is for a students audience for research/investigation on a specific topic.

Squidwarc and warcprox are also interesting approaches - I haven't played a lot with squidwarc yet.

Right now I am also exploring electron as a middle term alternative. It's chromium, you don't need to build it, you can run native applications and you can access crhomium api like the chrome.webrequest. Still it has its issues that you have to handle manually.

I will appreciate thoughts as well. I need to recommend/estimate different options, for something that's going to be funded as well as being later open source, so understanding different concepts and problems that you had to overcome are super useful.

machawk1 commented 5 years ago

@hanoii There has been some discussion relatively recently on approaches toward preserving the Web. I think WARCreate has some merit on easy of installation (click a button in the Chrome store) and usage (a single button to generate a WARC) at the expense of it being novel at a time where no software libraries existed on which to build WARC files and the browser APIs to do so were inadequate.

With that said, it is an approach (reusing the user's browser) where other WARC-generation tools have their own. For example:

As another data point, WAIL (Electron) is an Electron app that attempts to bundle browser-based crawling into a native interface. For disclosure, I was the creator of the original WAIL application that I developed to fill a shortcoming of browser APIs at the time for WARCreate (more info) but the re-imaging is the handy-work of @N0taN3rd, who is now an employee of @webrecorder and creator of Squidwarc.

hanoii commented 5 years ago

@machawk1 I saw all of them. Webrecorder didn't play that nicely with facebook unfortunately, which is the site I am trying most sites with as it's one we are mostly interested in archiving and likely very complex.

I saw WAIL, it's based on older stack of both electron and pywb but I might certainly get to see it. I didn't expect the tools to work right out of the box, so also trying to chose the tool I could contribute more to. The one thing I like about warcreate is that its codebase is manageable, and all javascript. Still I believe a bit more flexibility around not just being able to get a WARC archive but maybe other stuff.

HOw would warcreate behave with streaming media?

machawk1 commented 5 years ago

not just being able to get a WARC archive but maybe other stuff.

What sort of other stuff? Some browsers also natively support the HAR format and @ikreymer even created a library to convert from har2warc, so there may be potential there with regard to preservation.

@N0taN3rd is more than aware of WAIL-Electron using older versions of Electron and pywb. I continually encourage him to keep developing it despite his new affiliation. I am hoping the pings in this thread will serve as reassurance to the continued need of an app like his. ;-)

I have not done extensive validation of WARCreate with regard to streaming media in a while, so am unsure. Some more testing is in order and while I have appreciated user feedback in the past, have been unable to attract development cycles from others despite the codebase being all-JS.

I am unsure if this is because of the nature of the audience or the quality of the project being a detractor. For example, many users that want a simple non-technical solution may not have coding experience. On the flip-side, those that can may not due to the search for more technical solutions.

Any suggestions you have on making the tool more useful and functional from a technical perspective would be appreciated. A lot of the feedback has been high level, which is useful for the conveyed use case, but generally does not improve the software overall.

hanoii commented 5 years ago

What sort of other stuff? Some browsers also natively support the HAR format and @ikreymer even created a library to convert from har2warc, so there may be potential there with regard to preservation.

I need to keep, if possible, better consumable media. The WARC part is probably the default archiving to the replaying side of things, but it might make sense to have screenshots, annotations, tagging, and then maybe storing it somewhere. Attempting to fetch youtube videos and/or facebook through youtube-dl could come handy, so exploring the option to do that from, at least for now, an electron app at least as an initial PoC tool.

For the storage side of things I think https://www.archivematica.org could potentially work.

And if I go (or suggest) the electron route, I think some of what you did could be either re-used or serve an inspiration and if I do use it or look more closely to it I am sure gonna have feedback on it.

hanoii commented 5 years ago

@machawk1 Is there anything you can share on the path throughout building this tool in terms of big pitfalls that you found, any unorthodox thing you might have done in order to sort out pitfalls or the like.

It seems to be that https://developer.chrome.com/extensions/webRequest is a lot of what you need but a quick but did you had to rely heavily on other APIs. I will definitely go though the code in more depths but it's always helpful to have an overall approach/difficulties on your mind while going about things.

I am making good progress on the electron side of things. It's great how it has progressed.

machawk1 commented 5 years ago

@hanoii The capabilities and scope of Chrome extensions have come a long way since I originally created WARCreate. There was no webRequest API initially, writing files outside of the browser file sandbox was impossible, and the WebExtensions standard did not exist (Firefox was still using XUL-based add-ons).

Then webRequest was introduced as an experimental API and eventually accessible outside of Canary. One big issue that webRequest mitigated and there may be a more elegant way to accomplish it now, was reading the raw stream/bytes as they came "over the wire". This would have made caching these bytes for writing a lot easier but was not possible at the time. I believe something within the debugging/console API may make the process even easier than using webRequest.

The other issue issue was breaking out of the sandbox for writing. Per the blog post I linked before, accomplishing this initially required a "local server", which was unacceptable for a solution. There was no HTML File API then but eventually some libraries made this process possible as the standards made their way to the browsers.

I look forward to see what you come up with using Electron.

hanoii commented 5 years ago

@machawk1 also just saw https://github.com/N0taN3rd/node-warc from @N0taN3rd which is likely to help.

N0taN3rd commented 5 years ago

To put my 2 cents in, the way warcreate does things is the way to do it. There is an alternative way to do things but it would be painful due to the limitations of the browser and thus the contribution opportunity is still open. Simular contribution opportunities are open for the other projects mentioned here (all welcome them with open arms) if you can improve them to fit your needs and or welcome detailed issues regarding their short coming.

@hanoii If you have any questions etc feel free to contact myself or @ikreymer.

machawk1 commented 5 years ago

@N0taN3rd Thanks for chiming in here. :)

Can you provide some insight into other (the alternative) ways to do it from the browser? That could help guide other potential solutions and you are knowledgeable enough of all-things-JS where your pointers could help ensure the browser's capability's (re:extensions) are fully utilized.

N0taN3rd commented 5 years ago

The alternative I was alluding to is going the route of how Squidwarc does preservation using Chrome Devtools Protocol. Chrome extensions can use the CDP via the chrome.debugger API. The "painful" part of it is the limitations enforced on blobs and preservation could not operate per tab, it would have to use a separate tab for "crawling" the pages to be preserved.

ikreymer commented 5 years ago

There's a few options from perspective of a browser, I like the HAR approach as the browser just gives that to you directly, only downside is you have to have DevTools be open.

@hanoii As mentioned in the other issue, Facebook is particularly complicated and requires custom tweaking. We will take a look at tweaking it for the time being, but no guarantee that it won't break again in the future.

I need to keep, if possible, better consumable media. The WARC part is probably the default archiving to the replaying side of things, but it might make sense to have screenshots, annotations, tagging, and then maybe storing it somewhere. Attempting to fetch youtube videos and/or facebook through youtube-dl could come handy, so exploring the option to do that from, at least for now, an electron app at least as an initial PoC tool.

A lot of this is what Webrecorder is also trying to support. The issue with Facebook is not the capture process, but usually the replay/reproducibility of dynamic content that changes on each load. We have been working on this for 5+ years, and Facebook remains difficult, as mentioned in the other issue webrecorder/webrecorder#664

We are also considering options for a desktop/electron WR that is not just a player, but can also do capture, but our resources are limited. If this is something you'd be interested in helping out with, lets chat :)

ikreymer commented 5 years ago

More specifically, the issue with FB is that need to 'fuzzy match' requests to responses, and the rules for how that works is changing (by facebook changing their api).

A custom browser solution for capture will not really help with any of that, it needs to be done at request/response lookup time. Here's a (slightly old docs) on how this system works: https://github.com/webrecorder/pywb/wiki/Fuzzy-Match-Rules Mostly, this system hasn't changed much and we definitely need to add it to the latest docs!