machawk1 / warcreate

Chrome extension to "Create WARC files from any webpage"
https://warcreate.com
MIT License
206 stars 13 forks source link

Archiving plain text files on the web includes Chrome's display wrapper #62

Closed machawk1 closed 10 years ago

machawk1 commented 10 years ago

For example: http://matkelly.com/temp/20140811154106173.warc for http://www.cs.odu.edu/~mln/pubs/bio.txt includes

<html><head></head><body><pre style="word-wrap: break-word; white-space: pre-wrap;">

...which is not part of the original source.

machawk1 commented 10 years ago

The webrequest API ( https://developer.chrome.com/extensions/webRequest ) does not appear to have a means of intercepting/grabbing the content of the response (html, txt, or otherwise) much less a way to grab the gzipped/raw version of it. A subsequent Ajax request to accomplish this is unacceptable, as the user may have manipulated the state of the page, which might not be represented once the page is refetched/reloaded.

machawk1 commented 10 years ago

Still need to account for content-length discrepancy compared to content, which isn't gzipped when the content is accessed via the DOM.

machawk1 commented 10 years ago

Example WARC now produced for above URI: http://matkelly.com/temp/20140813180505625.warc

Replays well in webrecorder, related to https://github.com/ikreymer/pywb/issues/42