Open cleymour opened 9 years ago
This point has not been solved in WARC 1.1 revision.
I think that would be possible to adopt a deterministic approach to save what the user gets, by crawling the websites using a headless browser interface like Selenium, capturing and storing all the requests of the supported protocols (HTTP, WebSockets, etcetera) by using some sort of SSL capable sniffer like mitmproxy or a browser low-level hook.
The results of this traffic logging can be replayed by replacing that browser low level communication functions by a parser which would read the captured data. I don't know to which extent would be possible to implement this into the WARC format, as reading the specification is yet on my "to do" list, but I want to propose this method as a trustworthy way to archive complex dynamic sites.
Update: I've been able to setup a proof of concept based on the mitmproxy
default options used for the server-side replay feature.
Run mitmproxy --mode socks5 --listen-port 1080 -w captured_data
and perform the crawl through the local SOCKS proxy (you may need to install the unique wildcard certificate for HTTPS sites).
Disconnect from the Internet and run mitmproxy --mode socks5 --server-replay captured_data --server-replay-nopop --no-server-replay-kill-extra
, then visit the saved site.
Obviously this procedure is not mature enough, and resorting to such a low-level procedure implies a loss of the request-response correlation. This is why I suggested instead a browser-based approach, by the means of hooking the low-level functions inside a modern open-source browser.
After taking a peek at the specification, I've spotted some references to the capture of raw requests. The accurate capture of dynamic websites not only involves following links while crawling the site, but also triggering programmatically all the scripted user interactions and capturing the resulting requests.
The growing interest in technologies like WebAssembly is making more difficult the implementation of my idea for getting exact replicas of some advanced websites, but this kind of implementation allows for a more complete processing of the dynamic pages and web applications, be through an automated crawl based on DOM event triggering using an automated headless web engine or in some specific cases by custom scripts or human interaction.
All the above text is a compilation of verbose but uninformed thoughts, if I find some interest on it, I'll try to reword it and concrete some parts by reading the specification.
@0x2b3bfa0, if you haven't seen them you may be interested in https://github.com/internetarchive/warcprox and https://github.com/internetarchive/brozzler which take an approach similar to the what you describe.
Its unclear to me what specifically this issue is referring to by 'rendering AJAX interactions'. AJAX interactions themselves are standard HTTP messages and so can seemingly already be recorded.
Another fully-working open-source example of complicated capture and playback is https://github.com/webrecorder/webrecorder
Definition: it is necessary to have a common way of rendering AJAX interactions in WARC.
Decision: Propose a way to record rendering files either in the standard or as an appendix (NB from Clément: I would probably vote for the second solution).
Action: someone from LOCKSS team to propose something.