Closed nikcheerla closed 4 years ago
Actually, the purpose of SingleFile is to save the page as it is displayed in the browser and remove most of the resources (images, css, frames, fonts, etc.) that are not displayed from the saved page. It's already optimized to produce the smallest files it can but without degrading saved pages.
To be honest, I'm not highly enthusiastic to implement features which remove displayed resources. I would prefer that this task was done separately, for example in a user script or an extension that would run before SingleFile would save the page.
Out of curiosity, did you consider using SingleFileZ instead of SingleFile? It produces files that are at least 33% smaller. Since it produces zip files, it's also very easy to remove resources like the images after saving the page.
Thanks so much for your response! Completely understand if this feature is not on your radar at the moment, especially since it changes the displayed page.
For context, we're trying to construct a research dataset for performing machine learning on a large set of web pages. We would like to remove images of course to leave the size small, but also to get rid of the "base-64" image encodings that can't easily be interpreted as HTML content. That means that SingleFileZ doesn't really work for our use case (we want the HTML pages to still be interpretable and semantically meaningful.) We're probably going to end up forking this to add this option -- would you mind if we ask you a couple questions if we get stuck anywhere along the way? You've gotten us 90% of the way there anyways, so thank you for that :)
@nikcheerla I implemented the integration with userscripts I was referring to. I think this should solve your issue. To test if this fulfills your needs, you have to update SingleFile and you can install for example TamperMonkey (https://www.tampermonkey.net/) to run the userscript below which removes all the image just before capturing the page.
// ==UserScript==
// @name Remove images
// @namespace http://tampermonkey.net/
// @version 0.1
// @description Remove all images before saving the page
// @author Gildas
// @match *://*/*
// @grant none
// ==/UserScript==
(() => {
"use strict";
dispatchEvent(new CustomEvent("single-file-user-script-init"));
addEventListener("single-file-on-before-capture-request", () => {
Array.from(document.images).forEach(image => {
image.remove();
});
});
})();
You also have to export your settings, replace userScriptEnabled: false
with userScriptEnabled: true
, and import the modified settings file.
@nikcheerla any feedback?
I'm closing this issue because I consider it as fixed with the user script integration I implemented. Feel free to comment this issue or re-open it if necessary. You can also contact me if you want to fork SingleFile and need help.
@gildas-lormeau
Yeah it works perfectly -- sorry for not replying sooner. Thanks so much for all the help!
Best, Nikhil
No problem, you're welcome!
I'd like to save very compressed versions of certain large websites (so without any image data), but still preserve the layout of the website for visual inspection purposes. It seems like one way to do this would be to keep the image size, shape and positioning, but replace the content with white or black boxes of the same size. Is this a feature you would consider supporting?