gildas-lormeau / single-file-cli

CLI tool for saving a faithful copy of a complete web page in a single HTML file (based on SingleFile)
GNU Affero General Public License v3.0
620 stars 64 forks source link

Option `--crawl-replace-urls` does not replace the crawled URLs #131

Open VAdri opened 5 days ago

VAdri commented 5 days ago

The option --crawl-replace-urls indicates:

Replace URLs of saved pages with relative paths of saved pages on the filesystem

So if I understand correctly the HTML extracted by single-file should have all its URLs crawled with the option --crawl-links replaced by the file path on which they are exported.

However, when I try this command I get only the original URLs:

./single-file-x86_64-linux https://example.com --crawl-links=true --crawl-max-depth=1 --crawl-inner-links-only=false --crawl-replace-urls=true

I also tried this command from the README using the option --crawl-rewrite-rule but it did not work either:

./single-file-x86_64-linux https://www.wikipedia.org --crawl-links=true --crawl-inner-links-only=true --crawl-max-depth=1 --crawl-rewrite-rule="^(.*)\\?.*$ $1"

I was able to make it work on v2.0.0 but not since v2.0.2.

gildas-lormeau commented 1 day ago

In the first example, there are no inner links. The second example does not work anymore (I'm pretty sure it used to work in the past) because there are no link with a resolved URL starting with "https://www.wikipedia.org/" in the page.

VAdri commented 13 hours ago

Is it supposed to work only for inner links? Because I did put the option --crawl-inner-links-only=false in my first example.

But even with inner links only it doesn't do the trick apparently:

./single-file https://matklad.github.io/2024/09/23/what-is-io-uring.html --crawl-links=true --crawl-max-depth=1 --crawl-inner-links-only=true --crawl-replace-urls=true
gildas-lormeau commented 11 hours ago

That was not working because --crawl-replace-urls written in lowercase did not work. I fixed this issue in the last version I've just published.