Open skanga opened 2 years ago
There is this example in the README page which should "Save https://www.wikipedia.org and crawl its internal links with the query parameters removed from the URL":
single-file https://www.wikipedia.org --crawl-links=true --crawl-inner-links-only=true --crawl-max-depth=1 --crawl-rewrite-rule="^(.*)\\?.*$ $1"
The rewrite rule removes the query parameters in this example.
So the rule is like a search/replace on a URL to be crawled before it is fetched?
Yes, this is how it works, similarly to rewrite rules in Apache for example.
Thats excellent.
Perhaps you could add that to the docs for those unfamiliar with how it works.
Is there also a rewrite option for "--filename-template" parameter?
Lets say that my --urls-file has links in the format of https://www.website.com/toplevel/781-Some-Street-Some-City-ST-12345/987654_xyz/ and I want to save it to a file like this 781-Some-Street-Some-City-ST-12345-987654_xyz.html
So I might need some mechanism to strip some prefix and do some search/replace. Does such a thing exist?
The --help says: --filename-template Template used to generate the output filename (see help page of the extension for more info) [string] [default: "{page-title} ({date-iso} {time-locale}).html"]
Is there a link to "help page of the extension"? Where can I find it?
This feature does not exist for filenames. However, I agree that would be a nice feature to implement.
Can you please provide an example of passing the --crawl-rewrite-rule to single-file-cli ?