machawk1 / warcreate

Chrome extension to "Create WARC files from any webpage"
https://warcreate.com
MIT License
205 stars 13 forks source link

Store screenshot of page in WARC, too #109

Open machawk1 opened 5 years ago

machawk1 commented 5 years ago

In https://kris-sigur.blogspot.com/2018/11/on-screenshots-in-warcs.html @kris-sigur describes the storage of a screenshot in a WARC file. This would be useful for others (e.g., @CamtheWicked on Twitter, for whom I could not find a GitHub handle) and might be easy(-er) to accomplish by leveraging the native Chrome APIs as available.

I have not worked with the devtools(?) API programmatically from an extension, but this seems like it would be a suitable use case for preservation using a browser extension.

/cc @N0taN3rd because I think he may have worked with this part of the Chrome/Web- extension API.

N0taN3rd commented 5 years ago

I believe there are two options

  1. using the tabCapture extension api (never played with this)
  2. using the debugger permission and CDP command Page.captureScreenshot
machawk1 commented 5 years ago

@N0taN3rd Thanks for the input!

tabCapture seems to be limited to the current viewport, excluding anything that is not currently visible. This would be useful but I think the anticipated "screenshot" concept expected by a user is for the whole page despite what's currently visible.

The second option might be more feasible but a little more complex. I think it will require chrome.debugger.getTargets(), identify the current tab (I am not yet sure what else qualifies as a target), chrome.debugger.sendCommand() using the target and Page.captureScreenshot as the method without any commandParams per https://developer.chrome.com/extensions/debugger#method-sendCommand (the defaults appear to be suitable).

EDIT: ...and of course, converting the base64-encoded image data to something more suitable for WARC record storage. It might be easiest to keep it as b64 in the WARC but I am unsure if there will issues with interpretation given it is not a resource of web origin.