janreges / siteone-crawler

SiteOne Crawler is a cross-platform website crawler and analyzer for SEO, security, accessibility, and performance optimization—ideal for developers, DevOps, QA engineers, and consultants. Supports Windows, macOS, and Linux (x64 and arm64).
https://crawler.siteone.io/
MIT License
255 stars 17 forks source link

Output question - multiple reports per /lang-code/ #2

Closed fernstedt closed 8 months ago

fernstedt commented 1 year ago

Hello and thank you for a great tool.

I am doing a crawl on a website that have versions for all countries (130 of them) www.URL.com/en-uk/ as an example then almost the same pages with some local content.

I am trying to figure out a way besides doing bash for the output to be some of the url

output=$country/result.html

I could not do this from the tool (from what I can see) so I seek gudience. Otherwise I need to do 130 crawls, instead of the tool can save me diffrent countrys in diffrent folders.

I can do a bash that uses a loops a file and and where to replace words in the code.

But if this tool could manage to have variable output it would be great

janreges commented 1 year ago

Hi,

I'm glad my crawler is helping you :)

The crawler currently crawls the entire website on a given domain, or even on other domains based on --allowed-domain* options.

You can allow or deny crawling of URLs using --include--regex or --exclude--regex.

If you want to generate a bunch of reports for subpages starting with the language code, I believe this bash script will do exactly what you want.

Btw, I just deployed a new and very nice version of the HTML report. I hope you will be excited ;)

#!/bin/bash

COUNTRIES=("en-US" "en-UK" "cs-CZ")

for COUNTRY in "${COUNTRIES[@]}"
do
    COUNTRY_ESCAPED=${COUNTRY//-/\\-}

    ./swoole-cli crawler.php \
        --url='https://your.domain/'"$COUNTRY" \
        --include-regex='/^\/'"$COUNTRY_ESCAPED"'/' \
        --output-html-file='tmp/report.'"$COUNTRY.html"

done