CLI option --crawl-replace-urls does not do anything

gildas-lormeau / single-file-cli

CLI tool for saving a faithful copy of a complete web page in a single HTML file (based on SingleFile)

GNU Affero General Public License v3.0

540 stars 58 forks source link

CLI option --crawl-replace-urls does not do anything #40

Open andrewdbate opened 2 years ago

andrewdbate commented 2 years ago

When I run this command:

single-file --output-directory=outdir --dump-content=false --filename-template="{url-pathname-flat}.html" --crawl-links --crawl-save-session=session.json --crawl-replace-urls=true https://en.wikipedia.org/wiki/Thomas_Lipton

none of the files in the outdir directory have URLs of saved pages replaced with relative paths of other saved pages in outdir.

When I run this command, _wiki_Thomas_Lipton.html is downloaded to outdir. This is the file of URL from which the crawl started.

The Wikipedia page https://en.wikipedia.org/wiki/Thomas_Lipton has a link to https://en.wikipedia.org/wiki/Self-made_man in the first sentence. This page was also downloaded by SingleFile as _wiki_Self-made_man.html.

I was expecting the href to https://en.wikipedia.org/wiki/Self-made_man in _wiki_Thomas_Lipton.html to be rewritten to _wiki_Self-made_man.html but it was not. Am I using the CLI options incorrectly?

gildas-lormeau commented 2 years ago

Did you interrupt the command? URLs are replaced when all the pages have been crawled.

andrewdbate commented 2 years ago

No I didn't interrupt the command.

amirrh6 commented 2 years ago

Hi @gildas-lormeau! First, I'd like to appreciate for this amazing extension.

I faced the very same issue @andrewdbate discussed.

I tested https://xmrig.com because of its simple hierarchy.

Following internal links on https://xmrig.com should be considered:

Some links are duplicated inside the page, so I used --filename-conflict-action=skip flag.

This is the command I ran:

./single-file --output-directory=saved --filename-template="{url-pathname-flat}.html" --crawl-links=true --crawl-replace-urls=true --filename-conflict-action=skip https://xmrig.com

As the result, following files were created inside saved directory (as expected):

_.html
_benchmark.html
_download.html
_wizard.html

Everything has been well so far but links inside these files are not changed to relative links on file system.

You may find these files useful:

saved.zip

Thanks

NanoBaker commented 5 months ago

I'm having the same issue, each web page is downloaded successfully, but the links between them link back to the original website: docker run -v $(pwd):/usr/src/app/out singlefile "https://fiction.live/stories/Fiction-live-Software-Update/S46jksooFQWqqMAsY/home" --dump-content=false --crawl-links=true --crawl-inner-links-only=true --crawl-no-parent=true --crawl-max-depth=1 --crawl-replace-urls=true