janreges / siteone-crawler

SiteOne Crawler is a website analyzer and exporter you'll ♥ as a Dev/DevOps, QA engineer, website owner or consultant. Works on all popular platforms - Windows, macOS and Linux (x64 and arm64 too).
https://crawler.siteone.io/
MIT License
241 stars 15 forks source link

How to include ALL urls in a Crawl output? #11

Open wilhere opened 3 months ago

wilhere commented 3 months ago

Hi, I've been testing out the CLI version of the tool and am absolutely loving it so far. It's a very solidly put together tool with Great performance and information output.

I have been trying to tweak option flags to see if I can get a particular type of behavior for a crawl and then the generated results.

What I would like to do is have the tool observe and report ANY other URL it encounters on the target website, regardless of whether they are of the same site/TLD. In other words I would like to have all external links to other sites and even IP addresses at least observed and in one of the output files.

Is this possible with any regex or argument at the moment? Not necessarily talking about going to, and crawling, all those individual URLs though. In this scenario would these perhaps end up in the sitemap based on what I'm saying?

Thanks.

janreges commented 2 months ago

Hi @wilhere,

you have the option to specify which domains to crawl external files from, see parameter --allowed-domain-for-external-files.

Often, especially when using the offline website generation feature, it is convenient to set --allowed-domain-for-external-files=*, which ensures that all external JS/CSS/fonts, images or documents are also crawled/downloaded.

It is also possible to use the parameter --allowed-domain-for-crawling, where you can specify all domains to crawl if a link to those domains is found. You can also specify --allowed-domain-for-crawling=* with *, but this will eventually start crawling your website, then gradually other linked domains and so on ad infinitum.

Unfortunately, at the moment there is no option to say "if you find a link to an HTML page on another website, include it in the crawl, but do not crawl other pages found in the HTML code of this page" through some parameter.

In the next few days, I will try to think about how the crawler could be extended so that this behavior can be configured via parameters.