gildas-lormeau / single-file-cli

CLI tool for saving a faithful copy of a complete web page in a single HTML file (based on SingleFile)
GNU Affero General Public License v3.0
602 stars 63 forks source link

CLI option --filename-conflict-action=skip should not attempt to download page if file already exists #42

Open andrewdbate opened 2 years ago

andrewdbate commented 2 years ago

The current behavior of --filename-conflict-action=skip is as follows:

  1. download the page as usual
  2. if the file to be created already exists, do overwrite the file.

It would be more efficient to first check if the file already exists, and then only download the page if the file does not already exist.

This would support the following use case:

Suppose we use --urls-file to download a list of URLs. Some of those pages may fail to download (e.g., due to a network failure). In my experience, for a large list of URLs, it is likely that at least one page will fail to download. If there is an error downloading the page then no file will be created (at least this seems to be the behavior).

I was hoping to be able to use the options --filename-template="{url-pathname-flat}.html" and --filename-conflict-action=skip combined with the --urls-file option to be able to resume after an error. I was hoping that SingleFile would only attempt to download the URLs did not already have files.

However, with the current implementation, because SingleFile attempts to download the page again, this is too slow to be practical.

gildas-lormeau commented 2 years ago

An optimization could indeed be done when the template does not contain variables depending on the content of the page. Note that by default the template contains a variable to get the title of the page.