machawk1 / warcreate

Chrome extension to "Create WARC files from any webpage"
https://warcreate.com
MIT License
205 stars 13 forks source link

Working status, how does it work? #111

Open hanoii opened 5 years ago

hanoii commented 5 years ago

I am in the process of researching archiving tools/techniques for an investigation tool. It's amazing both the amount and scattering of different tools.

Plain static archiving is out of the question, I need "deep content" as you put it, being able to store/replay content browsed by private facebook posts, etc.

In trying your tool, I am not sure how it works. I click the button, wait a few seconds and I get a warc file. I couldn't open it though with https://github.com/webrecorder/webrecorder-player or https://github.com/webrecorder/pywb, it gives a record.

Also is unclear to me how should I use it. Is it recording everything all the time. HOw would storing stuff from facebook work?

Is this truly an agnostic extension or do it needs to understands the sites you are crawling?

Thanks!!!

hanoii commented 5 years ago

EDIT, also related to #112 .

machawk1 commented 5 years ago

Hi @hanoii, I just pushed a new version of WARCreate due to the Chrome Web Store stating a compliance issue. This has happened before, seems automated, and prone to false positives. Given the latest version prior was from 2017, this version should have some improvements in capture quality.

To answer your questions (I hope): WARCreate is currently site-agnostic. It does not capture sites but captures pages. These pages may be content behind authentication, whose payloads are stored similarly to "surface web" content.

What do you mean that pywb/webrecorder-player gives you a "record"?

The primary use case is that while browsing the Web, you should be able to click the icon, click the "generate WARC" button and, potentially after a short delay to amalgamate the resource representations, have a WARC downloaded to your local file system.

WARCreate uses an anticipatory model, collecting the representations in a cache in your browser until you browse to another page, at which point it is cleared and re-generated for the current page. If you choose to generate a WARC of the page, this cache is partially uses as the basis for WARC creation. For representations like the current HTML page, this is captured at the time the button is pressed, as a cached version would likely be stale depending on if the DOM was manipulated between page load and button push.

hanoii commented 5 years ago

@machawk1 me again. I have been exploring a different approach with electron and I was happy where it was heading but it seems they are worried about their end users not wanted to use a separate browser so I am now more focused into a chrome extension approach, or maybe some kind of communication between a chrome extension and an electron app.

Anyway, I went again to give this another spin, but I am still failing to get simple warc out of it (I mean ones that actually renders on webrecorder play for instance).

I tried drupal.org unsuccessfully. I have the following version installed (tried removing and re-adding).

screen shot 2019-02-05 at 16 10 14

Am I missing something?

machawk1 commented 5 years ago

@hanoii I was able to generate a WARC from drupal.org but it is somewhat problematic with respect to replay in pywb and webrecorder player. It is likely an issue with strict validation of the file being produced from WARCreate. I will need to look into it.

The approach of using a browser extension and the user's own browser is novel to WARCreate and rightfully so -- it's a tough task, especially when no WARC libraries (when WARCreate was written), to cache and save everything accessible from the browser API (and not over the wire) to the local system.

I mentioned WAIL(Electron) in #112. This is the port to Electron of a native app written in Python that I originally wrote to mitigate some barriers in WARCreate. Namely, it would communicate to WAIL directly. Your desire to have an Electron program be in the loop is somewhat reminiscent of this and feasible but both parties (the extension and Electron app) should be receptive to the process. Instead of an ad hoc approach (which is far, far easier), I was hoping to eventually utilize the WASAPI API for WARC "transfer". This would tools be a bit more interoperable.

hanoii commented 5 years ago

Are those limitations still there as far as you can tell, or if you were to rewrite some parts of it you think there are better alternatives? I have to yet look in depth at your code, but I see you cannot access raw data unless you use the network dev tools extension, but that is a devtool only extension, which could be an option to also consider.

I spoke with the webrecorder guys. I successfully used https://github.com/N0taN3rd/node-warc with a custom easy browser on electron to render a drupal.org and it worked quite nice, but know using a chrome extension is almost mandatory.

I was able to communicate easily between a chrome and an electron app, and now this other app coudl not even be electron but rather just a node app making it easier, but would still like to do warc generation from the browser for your exact same reason you state on your project page.

If we go this route I will probably dive much deeper on this extension.

I also tried facebook (knowing it's a hard site) and I got no warc at all from it, not sure if i have to wait a lot more but waited quite a bit.

hanoii commented 5 years ago

@machawk1 on top of the questions above I was looking a bit more into the code this morning. I might need to go over it more in depth but I see you are re-fetching css/js/images so that you can get its data, correct? that's the cache you mentioned? So you are storing requests information but mostly refetching everything you don't have the data?

How would XHR or AJAX requests be handled? would you also refetch those?

And I've been looking and it seems there's still no other way of getting the raw data of the request unless you do it externally through CDP or within a chrome dev extension. Right?

machawk1 commented 5 years ago

@hanoii webRequest allows WARCreate to read some headers and payload when they come over the wire. I believe due to some synchronicity issues, there was a need (as implemented) to refetch some resources based on the payloads "missing" as analyzed when the WARC is being created.

I should note that WARCreate should maintain a privileged trait with regard to AJAX and CORS. Normally, fetching resources in this way would cause the request to be rejected from the server hosting the resource. Also note that at one point we tried moving from XHR to Fetch but the latter was limited in what headers could be read from the response, so that information would be unavailable to be included in the WARC. Hence, you will see a bit of XHR in the code instead of the modern alternative.

Using devtools would make the job easier. The API did not exist when I initially created WARCreate but after it was introduced, with a cursory analysis and the advice of @N0taN3rd, we found that it needed to be "open" to be accessible. I am unsure if this is still the case but if not, would be open to explore using devtools if it gives a more comprehensive ability to capture what's coming over the wire. Having the raw data would be ideal but WARCreate currently attempts to account for the inability to do so at the time.

Another good thing to have would be a means of evaluating the extension. I have had reports of "it does not work on site X" but the reason is rarely distilled to be debuggable. Some sample, hosted Web pages that isolate a problematic feature would help to be able to isolate shortcomings. Having these hosted in a predictable environment (e.g., a test suite consisting of fundamental features on GitHub pages) would be helpful.

hanoii commented 5 years ago

I should note that WARCreate should maintain a privileged trait with regard to AJAX and CORS. Normally, fetching resources in this way would cause the request to be rejected from the server hosting the resource. Also note that at one point we tried moving from XHR to Fetch but the latter was limited in what headers could be read from the response, so that information would be unavailable to be included in the WARC. Hence, you will see a bit of XHR in the code instead of the modern alternative.

I am not sure I understood this. I was wondering what do you do on XHR or AJAX requests to store that data on the Warc, are those also re-fetched or is that payload actually on the webRequest API.

I recently worked on this site: www.moogmusic.com, it's an angular site with a rest interface so although probably complex to capture, it's predictable in the sense there's no random query strings appended or anything. A quick try on the extension also doesn't work with replaying afterwards on webrecorder player, and looking at the devtools of the players I see missing request on the REST resource.

Using devtools would make the job easier. The API did not exist when I initially created WARCreate but after it was introduced, with a cursory analysis and the advice of @N0taN3rd, we found that it needed to be "open" to be accessible. I am unsure if this is still the case but if not, would be open to explore using devtools if it gives a more comprehensive ability to capture what's coming over the wire. Having the raw data would be ideal but WARCreate currently attempts to account for the inability to do so at the time.

I believe it still has to be open, but it could potentially be a good compromise if it really helps.

Another good thing to have would be a means of evaluating the extension. I have had reports of "it does not work on site X" but the reason is rarely distilled to be debuggable. Some sample, hosted Web pages that isolate a problematic feature would help to be able to isolate shortcomings. Having these hosted in a predictable environment (e.g., a test suite consisting of fundamental features on GitHub pages) would be helpful.

Have you had any success in capturing twitter/facebook? I know now that replaying what's captured on those sites is complex on its own.

I don't know yet how to distill an issue from a site unless I debug the extension extensively. But as mentioned above I tried also lanacion.com.ar and that one also didn't replay properly.

But if I find something concrete I will sure let you know.

The one thing I am mostly worried about is the real feasibility of creating a warc from a regular extension. You seem to have done a great job overcoming some of the issues but I wonder if others are simply not possible to be done within the extension. Is your current feeling that there should be a way to create a fully valid ward out of every interaction of the site and the server from a regular extension?

payingattention commented 4 years ago

I couldn't open it though with https://github.com/webrecorder/webrecorder-player or https://github.com/webrecorder/pywb, it gives a record.

I can confirm that saving by "Uploading [WARC] to Collection" with https://conifer.rhizome.org/037 (https://WebRecorder.io/037 changed name) your WARC file only get to 50% and then says "Error Encountered".

Rhizome-Conifer _ #Accounts (Web archive collection for user @037)

The same https://twitter.com/prosodyContext page saved with the https://WebRecorder.net https://github.com/webrecorder/webrecorder-desktop ".AppImage" binary works (I do not mean to make you compete, multiple tools is important). I can provide the WARCreate WARC archive(s) in question if you ask.

Should I make a separate support issue ticket or is here appropriate/good?