gildas-lormeau / SingleFile

Web Extension for saving a faithful copy of a complete web page in a single HTML file
GNU Affero General Public License v3.0
14.79k stars 971 forks source link

Saving webpages with images replaced by substitutes (eg. white boxes)? #279

Closed nikcheerla closed 4 years ago

nikcheerla commented 4 years ago

I'd like to save very compressed versions of certain large websites (so without any image data), but still preserve the layout of the website for visual inspection purposes. It seems like one way to do this would be to keep the image size, shape and positioning, but replace the content with white or black boxes of the same size. Is this a feature you would consider supporting?

gildas-lormeau commented 4 years ago

Actually, the purpose of SingleFile is to save the page as it is displayed in the browser and remove most of the resources (images, css, frames, fonts, etc.) that are not displayed from the saved page. It's already optimized to produce the smallest files it can but without degrading saved pages.

To be honest, I'm not highly enthusiastic to implement features which remove displayed resources. I would prefer that this task was done separately, for example in a user script or an extension that would run before SingleFile would save the page.

Out of curiosity, did you consider using SingleFileZ instead of SingleFile? It produces files that are at least 33% smaller. Since it produces zip files, it's also very easy to remove resources like the images after saving the page.

nikcheerla commented 4 years ago

Thanks so much for your response! Completely understand if this feature is not on your radar at the moment, especially since it changes the displayed page.

For context, we're trying to construct a research dataset for performing machine learning on a large set of web pages. We would like to remove images of course to leave the size small, but also to get rid of the "base-64" image encodings that can't easily be interpreted as HTML content. That means that SingleFileZ doesn't really work for our use case (we want the HTML pages to still be interpretable and semantically meaningful.) We're probably going to end up forking this to add this option -- would you mind if we ask you a couple questions if we get stuck anywhere along the way? You've gotten us 90% of the way there anyways, so thank you for that :)

gildas-lormeau commented 4 years ago

@nikcheerla I implemented the integration with userscripts I was referring to. I think this should solve your issue. To test if this fulfills your needs, you have to update SingleFile and you can install for example TamperMonkey (https://www.tampermonkey.net/) to run the userscript below which removes all the image just before capturing the page.

// ==UserScript==
// @name         Remove images
// @namespace    http://tampermonkey.net/
// @version      0.1
// @description  Remove all images before saving the page
// @author       Gildas
// @match        *://*/*
// @grant        none
// ==/UserScript==

(() => {

    "use strict";

    dispatchEvent(new CustomEvent("single-file-user-script-init"));
    addEventListener("single-file-on-before-capture-request", () => {
        Array.from(document.images).forEach(image => {            
            image.remove();
        });
    });

})();
gildas-lormeau commented 4 years ago

You also have to export your settings, replace userScriptEnabled: false with userScriptEnabled: true, and import the modified settings file.

gildas-lormeau commented 4 years ago

I added in the wiki https://github.com/gildas-lormeau/SingleFile/wiki/How-to-execute-a-user-script-before-a-page-is-saved.

gildas-lormeau commented 4 years ago

@nikcheerla any feedback?

gildas-lormeau commented 4 years ago

I'm closing this issue because I consider it as fixed with the user script integration I implemented. Feel free to comment this issue or re-open it if necessary. You can also contact me if you want to fork SingleFile and need help.

nikcheerla commented 4 years ago

@gildas-lormeau

Yeah it works perfectly -- sorry for not replying sooner. Thanks so much for all the help!

Best, Nikhil

gildas-lormeau commented 4 years ago

No problem, you're welcome!