janreges / siteone-crawler

SiteOne Crawler is a website analyzer and exporter you'll ♥ as a Dev/DevOps, QA engineer, website owner or consultant. Works on all popular platforms - Windows, macOS and Linux (x64 and arm64 too).
https://crawler.siteone.io/
MIT License
241 stars 15 forks source link
analyzer crawler crawling performance qa quality-assessment security seo seotools stress-testing swoole testing website

SiteOne Crawler

SiteOne Crawler is a very useful and easy-to-use tool you'll ♥ as a Dev/DevOps, website owner or consultant. Works on all popular platforms - Windows, macOS and Linux (x64 and arm64 too).

It will crawl your entire website in depth, analyze and report problems, show useful statistics and reports, generate an offline version of the website, generate sitemaps or send reports via email. Watch a detailed video with a sample report for the astro.build website.

This crawler can be used as a command-line tool (see releases and video), or you can use a multi-platform desktop application with graphical interface (see video about app).

I also recommend looking at the project website crawler.siteone.io.

GIF animation of the crawler in action (also available as a video):

SiteOne Crawler

Table of contents

Features

In short, the main benefits can be summarized in these points:

The following features are summarized in greater detail:

Crawler

Dev/DevOps assistant

Analyzer

Reporter

Offline website generator

Sitemap generator

Don't hesitate and try it. You will love it as we do! ♥

For active contributors

Installation

Ready-to-use releases

You can download ready-to-use releases from GitHub releases for all major platforms (Linux, Windows, macOS, arm64).

Unpack the downloaded archive, and you will find the crawler or crawler.bat (Windows) executable binary and run crawler by ./crawler --url=https://my.domain.tld.

Note for Windows users: use Cygwin-based release *-win-x64.zip only if you can't use WSL (Ubuntu/Debian), what is recommended. If you really have to use the Cygwin version, set --workers=1 for higher stability.

Note for macOS users: In case that Mac refuses to start the crawler from your Download folder, move the entire folder with the Crawler via the terminal to another location, for example to the homefolder ~.

Linux (x64)

Most easily installation is on most Linux (x64) distributions.

git clone https://github.com/janreges/siteone-crawler.git
cd siteone-crawler

# run crawler with basic options
./crawler --url=https://my.domain.tld

Windows (x64)

If using Windows, the best choice is to use Ubuntu or Debian in WSL.

Otherwise, you can download swoole-cli-v4.8.13-cygwin-x64.zip from Swoole releases and use precompiled Cygwin-based bin/swoole-cli.exe.

A really functional and tested Windows command looks like this (modify path to your swoole-cli.exe and src\crawler.php):

c:\Tools\swoole-cli-v4.8.13-cygwin-x64\bin\swoole-cli.exe C:\Tools\siteone-crawler\src\crawler.php --url=https://www.siteone.io/

NOTICE: Cygwin does not support STDERR with rewritable lines in the console. Therefore, the output is not as beautiful as on Linux or macOS.

macOS (arm64, x64)

If using macOS with latest arm64 M1/M2 CPU, download arm64 version swoole-cli-v4.8.13-macos-arm64.tar.xz, unpack and use its precompiled swoole-cli.

If using macOS with Intel CPU (x64), download x64 version swoole-cli-v4.8.13-macos-x64.tar.xz, unpack and use its precompiled swoole-cli.

Linux (arm64)

If using arm64 Linux, you can download swoole-cli-v4.8.13-linux-arm64.tar.xz and use its precompiled swoole-cli.

Usage

To run the crawler, execute the crawler executable file from the command line and provide the required arguments:

Basic example

./crawler --url=https://mydomain.tld/ --device=mobile

Fully-featured example

./crawler --url=https://mydomain.tld/ \
  --output=text \
  --workers=2 \
  --memory-limit=1024M \
  --timeout=5 \
  --proxy=proxy.mydomain.tld:8080 \
  --http-auth=myuser:secretPassword123 \
  --user-agent="My User-Agent String" \
  --extra-columns="DOM,X-Cache(10),Title(40),Keywords(50),Description(50>)" \
  --accept-encoding="gzip, deflate" \
  --url-column-size=100 \
  --max-queue-length=3000 \
  --max-visited-urls=10000 \
  --max-url-length=5000 \
  --include-regex="/^.*\/technologies.*/" \
  --include-regex="/^.*\/fashion.*/" \
  --ignore-regex="/^.*\/downloads\/.*\.pdf$/i" \
  --analyzer-filter-regex="/^.*$/i" \
  --remove-query-params \
  --add-random-query-params \
  --show-scheme-and-host \
  --do-not-truncate-url \
  --output-html-report=tmp/myreport.html \
  --output-json-file=/dir/report.json \
  --output-text-file=/dir/report.txt \
  --add-timestamp-to-output-file \
  --add-host-to-output-file \
  --offline-export-dir=tmp/mydomain.tld \
  --replace-content='/<foo[^>]+>/ -> <bar>' \
  --ignore-store-file-error \
  --sitemap-xml-file==/dir/sitemap.xml \
  --sitemap-txt-file==/dir/sitemap.txt \
  --sitemap-base-priority=0.5 \
  --sitemap-priority-increase=0.1 \
  --mail-to=your.name@my-mail.tld \
  --mail-to=your.friend.name@my-mail.tld \
  --mail-from=crawler@ymy-mail.tld \
  --mail-from-name="SiteOne Crawler" \
  --mail-subject-template="Crawler Report for %domain% (%date%)" \
  --mail-smtp-host=smtp.my-mail.tld \
  --mail-smtp-port=25 \
  --mail-smtp-user=smtp.user \
  --mail-smtp-pass=secretPassword123

Arguments

For a clearer list, I recommend going to the documentation: https://crawler.siteone.io/configuration/command-line-options/

Basic settings

Output settings

Resource filtering:

In the default setting, the crawler crawls and downloads all the content it comes across - HTML pages, images, documents, javascripts, stylesheets, fonts, just absolutely everything it sees. These options allow you to disable (and remove from the HTML) individual types of assets and all related content.

For example, it is very useful to disable JavaScript on modern websites, e.g. on React with NextJS, which have SSR, so they work fine without JavaScript from the point of view of content browsing and navigation.

It is particularly useful to disable JavaScript in the case of exporting websites built e.g. on React to offline form (without HTTP server), where it is almost impossible to get the website to work from any location on the disk only through the file:// protocol.

Advanced crawler settings

File export settings

Mailer options

NOTICE: For now, only SMTP without encryption is supported, typically running on port 25. If you are interested in this tool, we can also implement secure SMTP support, or simply send me a pull request with lightweight implementation.

Upload options

If necessary, you can also use your own endpoint --upload-to for saving the HTML report.

How to implement own endpoint: Your own endpoint need to accept a POST request, where in htmlBody is the gzipped HTML body of the report, retention is the retention value, and password is an optional password to encrypt access to the HTML. The response must be JSON with url key with the URL where the report is available.

Offline exporter options

Sitemap options

Expert options

Roadmap

If you have any suggestions or feature requests, please open an issue on GitHub. We'd love to hear from you!

Your contributions with realized improvements, bug fixes, and new features are welcome. Please open a pull request :-)

Motivation to create this tool

If you are interested in the author's motivation for creating this tool, read it on the project website.

Disclaimer

Please use responsibly and ensure that you have the necessary permissions when crawling websites. Some sites may have rules against automated access detailed in their robots.txt.

The author is not responsible for any consequences caused by inappropriate use or deliberate misuse of this tool.

License

This work is licensed under a License: MIT