Closed devinat1 closed 1 month ago
What I ideally want is a directory for each, with CSS, HTML, and JS content for each site, and I wish to serve these sites myself.
So because of the way the clones were saved, opening one site opens a different site, so if I open the html for worldbank.org, it actually opens the HTML content of HP
Hi @devinat1,
First of all, I recommend adding to your script logging or displaying to the output final commands with all the parameters it sets.
In particular, for effective help, I need to know the final crawler --xyz
command that your script is trying to run. From that, I can probably find out very quickly where the cause of any problem is.
By the way, I recommend adding --allowed-domain-for-external-files=*
- this will ensure that also external JS/styles/fonts/images will be loaded from other domains. This is usually necessary, because many sites load e.g. JS libraries from CDN, etc.
For example, here is command for worldbank.org (just limited to 500 URLs):
./crawler \
--url=https://www.worldbank.org/ \
--max-visited-urls=500 \
--offline-export-dir=tmp/worldbank.org \
--allowed-domain-for-external-files=*
And here is tmp/worldbank.org
directory content ... exported website works nice. The directories starting with an underscore _
are external domains, from which external assets were downloaded to make the web work as well as possible in offline form and contain all JavaScripts, images, fonts, etc.
Hi @janreges thank you for your feedback. Here is one of the commands that my crawler script is running: crawler '--url=office365.com', '--offline-export-dir=/home/bond/Desktop/agent-collector/utils/website-scraper/../../data/synthetic/clones/', '--workers=10', '--max-visited-urls=500', '--allowed-domain-for-external-files=*', '--ignore-robots-txt'
The same issue occurs with your suggestion of allowing domains for external files.
This is the exact output I am getting upon running the scraper:
#### #### #####
#### #### #######
#### ### #### #########
#### ###### #### ###### ####
###################### ##### ####
####### ####### ##### ####
####### ####### # ####
###################### ####
#### ###### #### ####
#### ## #### ####
#### #### ##################
#### #### ##################
==================================================
# SiteOne Crawler, v1.0.8.20240824 #
# Author: jan.reges@siteone.cz #
==================================================
Progress report | URL | Status | Type | Time | Size | Access. | Best pr.
-----------------------------------------------------------------------------------------------------------------------------------------------------------------------------
1/2 | 50% |>>>>> | / | 301 | Redirect | 423 ms | 133 B | |
2/2 | 100% |>>>>>>>>>>| https://azure.microsoft.com/en-us/ | 200 | HTML | 641 ms | 493 kB | | 2/5
Redirected URLs
---------------
Status | Redirected URL | Target URL | Found at URL
-----------------------------------------------------------------------------------------------------------------------------------------------------------------------------
301 | / | https://azure.microsoft.com/en-us/ |
404 URLs
--------
No 404 URLs found.
SSL/TLS info
------------
Info | Text
-----------------------------------------------------------------------------------------------------------------------------------------------------------------------------
Issuer | C=US, O=Microsoft Corporation, CN=Microsoft Azure RSA TLS Issuing CA 08
Subject | C=US, ST=WA, L=Redmond, O=Microsoft Corporation, CN=gamedev.microsoft.com
Valid from | Sep 10 18:13:29 2024 GMT (VALID already 23.3 day(s))
Valid to | Sep 5 18:13:29 2025 GMT (VALID still for 336.7 day(s))
Supported protocols | TLSv1.2, TLSv1.3
RAW certificate output | Certificate:
Data:
Version: 3 (0x2)
Serial Number:
33:00:6c:7f:df:…6:a6:b2:28:28:
8a:f7:d1:23:5c:b9:bd:87
RAW protocols output | Connecting to 20.231.239.246
depth=2 C=US, O=DigiCert Inc, OU=www.digicert.com, CN=DigiCert Global…s not sent
Verify return code: 0 (ok)
---
DONE
TOP fastest URLs
----------------
No fast URLs fastest than 1 second(s) found.
TOP slowest URLs
----------------
No slow URLs slowest than 0.01 second(s) found.
SEO metadata
------------
No URLs.
OpenGraph metadata
------------------
No URLs with OpenGraph data (og:* or twitter:* meta tags).
Heading structure
-----------------
No URLs to analyze heading structure.
HTTP headers
------------
Header | Occurs | Unique | Values preview | Min value | Max value
----------------------------------------------------------------------------------------------------------------------------------------------------------------
Connection | 1 | 1 | close | |
Content-Length | 1 | - | [ignored generic values] | 0 B | 0 B
Content-Type | 1 | 1 | text/html | |
Date | 1 | - | [ignored generic values] | 2024-09-29 | 2024-09-29
Location | 1 | 1 | https://azure.microsoft.com/en-us/ | |
Server | 1 | 1 | Kestrel | |
Strict-Transport-Security | 1 | 1 | max-age=31536000 | |
HTTP header values
------------------
Header | Occurs | Value
---------------------------------------------------------------------------------------------------------------------------------------------------------------
Connection | 1 | close
Content-Type | 1 | text/html
Location | 1 | https://azure.microsoft.com/en-us/
Server | 1 | Kestrel
Strict-Transport-Security | 1 | max-age=31536000
Best practices
--------------
Analysis name | OK | Notice | Warning | Critical
--------------------------------------------------------------------------------
Large inline SVGs (> 5120 B) | 2 | 0 | 0 | 0
Invalid inline SVGs | 2 | 0 | 0 | 0
Duplicate inline SVGs (> 5 and > 1024 B) | 2 | 0 | 0 | 0
DOM depth (> 30) | 0 | 0 | 1 | 0
Heading structure | 1 | 0 | 1 | 0
Title uniqueness (> 10%) | 0 | 0 | 1 | 0
Description uniqueness (> 10%) | 0 | 0 | 1 | 0
Brotli support | 0 | 0 | 0 | 0
WebP support | 0 | 0 | 1 | 0
AVIF support | 0 | 0 | 1 | 0
Accessibility
-------------
Nothing to report.
Source domains
--------------
Domain | Totals | HTML | Redirect
--------------------------------------------------------------------
azure.com | 1/133B/423ms | | 1/133B/423ms
azure.microsoft.com | 1/493kB/641ms | 1/493kB/641ms |
Content types
-------------
Content type | URLs | Total size | Total time | Avg time | Status 20x | Status 30x
-------------------------------------------------------------------------------------
HTML | 1 | 493 kB | 641 ms | 641 ms | 1 | 0
Redirect | 1 | 133 B | 423 ms | 423 ms | 0 | 1
Content types (MIME types)
--------------------------
Content type | URLs | Total size | Total time | Avg time | Status 20x | Status 30x
---------------------------------------------------------------------------------------------------
text/html | 1 | 133 B | 423 ms | 423 ms | 0 | 1
text/html;charset=utf-8 | 1 | 493 kB | 641 ms | 641 ms | 1 | 0
DNS info
--------
DNS resolving tree
------------------------------------------------------------------------
azure.com
IPv4: 20.231.239.246
IPv4: 20.112.250.133
IPv4: 20.236.44.162
IPv4: 20.70.246.20
IPv4: 20.76.201.171
DNS server: 127.0.0.53
Security
--------
Nothing to report.
Analysis stats
--------------
Class::method | Exec time | Exec count
-------------------------------------------------------------------------------
SslTlsAnalyzer::getTLSandSSLCertificateInfo | 917 ms | 1
Manager::parseDOMDocument | 96 ms | 1
BestPracticeAnalyzer::checkMissingQuotesOnAttributes | 21 ms | 1
BestPracticeAnalyzer::checkNonClickablePhoneNumbers | 14 ms | 1
BestPracticeAnalyzer::checkMaxDOMDepth | 12 ms | 1
BestPracticeAnalyzer::checkHeadingStructure | 4 ms | 1
BestPracticeAnalyzer::checkInlineSvg | 1 ms | 1
SeoAndOpenGraphAnalyzer::analyzeSeo | 0 ms | 1
SeoAndOpenGraphAnalyzer::analyzeOpenGraph | 0 ms | 1
SeoAndOpenGraphAnalyzer::analyzeHeadings | 0 ms | 1
BestPracticeAnalyzer::checkTitleUniqueness | 0 ms | 1
BestPracticeAnalyzer::checkBrotliSupport | 0 ms | 1
BestPracticeAnalyzer::checkMetaDescriptionUniqueness | 0 ms | 1
BestPracticeAnalyzer::checkWebpSupport | 0 ms | 1
BestPracticeAnalyzer::checkAvifSupport | 0 ms | 1
Content processor stats
-----------------------
Class::method | Exec time | Exec count
-----------------------------------------------------------------------------------
HtmlProcessor::findUrls | 3 ms | 1
HtmlProcessor::applyContentChangesForOfflineVersion | 3 ms | 1
NextJsProcessor::applyContentChangesBeforeUrlParsing | 0 ms | 1
HtmlProcessor::applyContentChangesBeforeUrlParsing | 0 ms | 2
AstroProcessor::applyContentChangesBeforeUrlParsing | 0 ms | 1
JavaScriptProcessor::applyContentChangesBeforeUrlParsing | 0 ms | 1
SvelteProcessor::applyContentChangesBeforeUrlParsing | 0 ms | 1
CssProcessor::applyContentChangesBeforeUrlParsing | 0 ms | 1
================================================================================================================================================================================
Total execution time 3.2 s using 10 workers and 2048M memory limit (max used 8 MB)
Total of 2 visited URLs with a total size of 493 kB and power of 0 reqs/s with download speed 152 kB/s
Response times: AVG 532 ms MIN 423 ms MAX 641 ms TOTAL 1.1 s
================================================================================================================================================================================
Summary
-------
⚠️ No titles provided for uniqueness check.
⚠️ No meta descriptions provided for uniqueness check.
⚠️ No WebP image found on the website.
⚠️ No AVIF image found on the website.
⚠️ 1 page(s) with skipped heading levels.
⚠️ 1 page(s) with deep DOM (> 30 levels).
⏩ Redirects - 1 redirect(s) found.
⏩ DNS IPv6: domain azure.com does not support IPv6 (DNS server: 127.0.0.53).
✅ 404 OK - all pages exists, no non-existent pages found.
✅ SSL/TLS certificate is valid until Sep 5 18:13:29 2025 GMT. Issued by C=US, O=Microsoft Corporation, CN=Microsoft Azure RSA TLS Issuing CA 08. Subject is C=US, ST=WA, L=Redmond, O=Microsoft Corporation, CN=gamedev.microsoft.com.
✅ SSL/TLS certificate issued by 'C=US, O=Microsoft Corporation, CN=Microsoft Azure RSA TLS Issuing CA 08'.
✅ Performance OK - all non-media URLs are faster than 3 seconds.
✅ HTTP headers - found 7 unique headers.
✅ All pages support Brotli compression.
✅ All pages have quoted attributes.
✅ All pages have inline SVGs smaller than 5120 bytes.
✅ All pages have inline SVGs with less than 5 duplicates.
✅ All pages have valid or none inline SVGs.
✅ All pages without multiple <h1> headings.
✅ All pages have <h1> heading.
✅ All pages have clickable (interactive) phone numbers.
✅ All pages have valid HTML.
✅ All pages have image alt attributes.
✅ All pages have form labels.
✅ All pages have aria labels.
✅ All pages have role attributes.
✅ All pages have lang attribute.
✅ DNS IPv4 OK: domain azure.com resolved to 20.231.239.246, 20.112.250.133, 20.236.44.162, 20.70.246.20, 20.76.201.171 (DNS server: 127.0.0.53).
✅ Security - no findings.
📌 Text report saved to '/usr/local/siteone-crawler/tmp/azure.com.output.20241004-021643.txt' and took 0 ms.
📌 JSON report saved to '/usr/local/siteone-crawler/tmp/azure.com.output.20241004-021643.json' and took 1 ms.
📌 HTML report saved to '/usr/local/siteone-crawler/tmp/azure.com.report.20241004-021643.html' and took 37 ms.
📌 Offline website generated to '/home/bond/Desktop/agent-collector/utils/website-scraper/../../data/synthetic/clones/azure' and took 4 ms.
And the offline website generated just gives the following: <meta http-equiv="refresh" content="0; url=https://azure.microsoft.com/en-us/"> Redirecting to https://azure.microsoft.com/en-us/ ...
The bad directory issue was on my side, but I am unsure as to how to resolve the above issue.
The problem is that you export all the sites to the same folder “clones”. For each domain you run the crawler for, you have to dedicate its own folder. So instead of the "clones" folder, define for example "clones/worldbank.org".
I'll add this information to the documentation as well to make it clear.
As for azure.com - this domain redirects to a completely different domain.
There is a mechanism in the crawler to allow the crawler to follow the redirect for the first defined URL and crawl the entire other domain as well, but only if the 2nd tier domain has not changed.
So this will work correctly if, for example, the --url domain "abc.com" redirects to "www.abc.com", or vice versa "www.abc.com" redirects to "abc.com", or the domain "abc.com" redirects to "subdomain.en.abc.com".
In that case, for Microsoft pages, could help --url=https://www.azure.com --allowed-domain-for-crawling='*.microsoft.com'
Upon running the crawler with this script: https://gist.github.com/devinat1/38a3261736e2a4cf5b54af3107b753e0 I am getting the following output for several of the sites:
<meta http-equiv="refresh" content="0; url=../index.html"> Redirecting to https://www.atlassian.com/ ...
I am also getting a strange directory structure as follows: