gildas-lormeau / single-file-cli

CLI tool for saving a faithful copy of a complete web page in a single HTML file (based on SingleFile)
GNU Affero General Public License v3.0
645 stars 63 forks source link

How does --crawl-rewrite-rule work? #2

Open skanga opened 2 years ago

skanga commented 2 years ago

Can you please provide an example of passing the --crawl-rewrite-rule to single-file-cli ?

gildas-lormeau commented 2 years ago

There is this example in the README page which should "Save https://www.wikipedia.org and crawl its internal links with the query parameters removed from the URL":

single-file https://www.wikipedia.org --crawl-links=true --crawl-inner-links-only=true --crawl-max-depth=1 --crawl-rewrite-rule="^(.*)\\?.*$ $1"

The rewrite rule removes the query parameters in this example.

skanga commented 2 years ago

So the rule is like a search/replace on a URL to be crawled before it is fetched?

gildas-lormeau commented 2 years ago

Yes, this is how it works, similarly to rewrite rules in Apache for example.

skanga commented 2 years ago

Thats excellent.

Perhaps you could add that to the docs for those unfamiliar with how it works.

skanga commented 2 years ago

Is there also a rewrite option for "--filename-template" parameter?

Lets say that my --urls-file has links in the format of https://www.website.com/toplevel/781-Some-Street-Some-City-ST-12345/987654_xyz/ and I want to save it to a file like this 781-Some-Street-Some-City-ST-12345-987654_xyz.html

So I might need some mechanism to strip some prefix and do some search/replace. Does such a thing exist?

The --help says: --filename-template Template used to generate the output filename (see help page of the extension for more info) [string] [default: "{page-title} ({date-iso} {time-locale}).html"]

Is there a link to "help page of the extension"? Where can I find it?

gildas-lormeau commented 2 years ago

This feature does not exist for filenames. However, I agree that would be a nice feature to implement.