janreges / siteone-crawler

SiteOne Crawler is a cross-platform website crawler and analyzer for SEO, security, accessibility, and performance optimization—ideal for developers, DevOps, QA engineers, and consultants. Supports Windows, macOS, and Linux (x64 and arm64).
https://crawler.siteone.io/
MIT License
255 stars 17 forks source link

Bad directory structure and output upon running siteone crawler for clone generation #18

Closed devinat1 closed 1 month ago

devinat1 commented 1 month ago

Upon running the crawler with this script: https://gist.github.com/devinat1/38a3261736e2a4cf5b54af3107b753e0 I am getting the following output for several of the sites: <meta http-equiv="refresh" content="0; url=../index.html"> Redirecting to https://www.atlassian.com/ ...

I am also getting a strange directory structure as follows:

sciafratideviant.html
scibids-ai
scibids-ai.html
science
science-instruments
science-instruments.html
science-tech
science-tech.html
science.247c7a1485.html
science.html
sciences
sciences.html
scientists-urgently-warn-stop-drinking-130000353.html
scienze
scienze-ambiente
scienze-ambiente.html
scienze.html
scornedwords
scornedwords.html
scottish-highlands-united-kingdom
scottmarshall
scottsdale-az
screener
screener.html
script
scripts
scripts.html
scrivi
scrivi.html
scuola
scuola.html
se
se-sv
se.html
sean-diddy-combs-hit-wave-190544482.html
search
search.1c1c7c493e.html
search.1e67f42081.html
search.28db2af6e4.html
search.2a141c7c88.html
search.2b1fb8d0f0.html
search.38e12fd961.html
search.3bbf2ed116.html
search.3e9c2b1105.html
search.41da4a681b.html
search.43aca65a58.html
search.47d8ecd605.html
search.47db041ad0.html
search.51596539c3.html
search.5d562d7e4d.html
search.659bffd15c.html
search.6b0a4daf7d.html
search.6dc3325e5b.html
search.729a11f72b.html
search.754005940c.html
search.75b2894bda.html
search.8b483e9e76.html
search.95d221ad78.html
search.9b96d793d2.html
search.9d075b5c46.html
search.a0ae0e1d37.html
search.a7862970c8.html
search.b0bd40aba6.html
search.c0ebf23336.html
search.cb14db3794.html
search.cb83ba1855.html
search.d484738c82.html
search.d8c58b7149.html
search.db8f4e2cde.html
search.e87cb56cc5.html
search.e9ac0e3b03.html
search.e9d2b9d62c.html
search.ef99949031.html
search.fe3892d8f4.html
search.html
seattle-mariners-oakland-athletics-9f1825d861004332a0d44436d096cf62.html
seaview-united-kingdom
secondary-dns
secondary-dns.html
secteur
section
secure-identity-commitment
secure-identity-commitment.html
securedrop.html
security.html
seijoishii
seijoishii.html
seiu8
seiu8.html
select
select.html
self-proclaimed-night-owl-transformed-120000355.html
sell-nvidia-buy-china-thats-223000129.html
sennheiser-hearing
sennheiser-hearing.html
seo
seo.html
seraphinaseow
seraphinaseow.html
series
serietv.html
service
services
services.html
servizi
servizi.html
serwisy.html
sesam-is-open
sesam-is-open.html
session-replay.html
sessions.html
sette
sette.html
settings.js
seville-spain
sex-and-love.html
sg
sg-en
shabnamwebsait
shabnamwebsait.html
shannonjade2
share-your-love
shelbyheinrich
shelbyheinrich.html
shimadougu
shimadougu.html
shinkinedo
shinkinedo.html
shoes-alcohol-products-impacted-dockworkers-174101648.html
shomeido
shomeido.html
shop.html
shopping
shopping.html
shortdocs.html
shortest
shortest.html
shorts
showbiz
showbiz.html
shows
shows.html
si-sl
sienaegiljum
sienaegiljum.html
sigma.html
sign-up
sign-up-for-cnbc-newsletters
sign-up-for-cnbc-newsletters.html
sign-up.html
sign_in.1f4d78de9e.html
sign_in.5bedf90760.html
sign_in.88c4ac91c1.html
sign_in.d306ab5922.html
sign_in.e8a2e09bbc.html
sign_in.html
sign_up.88c4ac91c1.html
sign_up.html
signin.html
signup.html
signup_login.html
site
site-help
site-map
site-map.html
sitearchive.html
sitemap
sitemap.html
sitemaps
sitemaps.html
sites
sk-sk
sl
small-business
small-business.html
smart-land-x
smart-land-x.html
smartwatch.html
smi-o-nas
smi-o-nas.html
smikalsgirl08.html
smoda
smoda.html
smtp-monitoring
smtp-monitoring.html
snowden2023
snowden2023.html
snthemes.js
socal
social-impact
social-media
social-media.html
social-network
social-network.html
social.html
sociedad
sociedad.html
societe
societe.html
societes
societes.html
societies.html
society
society.html
sofa555.html
software
software-development.html
software.html
soli.html
solutions
solutions.html
sonic-store
sonic-store.html
sonoma-ca
sonstiges
sophiaharrishp
sophiaharrishp.html
sophieolszowski
sophieolszowski.html
sorrento-australia
sortir-paris
sortir-paris.html
sounds
soundtracks.255f22cd9d.html
soundtracks.6fcda35058.html
sources
south-west-england-united-kingdom
south.html
southwold-united-kingdom
space
space-living-technology
space-living-technology.html
space-travel-technology
space-travel-technology.html
space.html
spacex.html
spain
sparks
sparks.html
spc
special
special-offers
special-offers.html
special_issues_guidelines.html
specials
specials.04190ffb7f.html
specials.37e80b1a80.html
specials.59245eebc9.html
specialsupplements
specialsupplements.html
spectacular_mockingbird3
spectacular_mockingbird3.html
speed
speed-test
speed-test.html
speed.html
spenceralthouse
spettacoli
spettacoli.html
spidelli
spidelli.html
spiele
spirit-halloweens-report-reveals-top-173300343.html
splash
splash.html
sport
sport.html
sports
sports.html
sportszyuen
sportszyuen-1
sportszyuen-1.html
sportszyuen.html
spotlight
spotlight-ctv-unlimited
spotlight-ctv-unlimited.html
spotlight-election-insights
spotlight-election-insights.html
spotlight.html
sq
squawk-box-europe
squawk-box-europe.html
squawk-box-us
squawk-box-us.html
squawk-on-the-street
squawk-on-the-street.html
srinagar-news.html
srv
ssl-certificates
ssl-certificates.html
ssl-monitoring
ssl-monitoring.html
st-andrews-united-kingdom
st-augustine-fl
st-george-ut
st-joseph-mi
st-petersburg-fl
stack-match-block-puzzle-game.html
stage
stage.html
staithes-united-kingdom
standalonesignup.a0472ab103.js
start
startup-programs
startup-programs.html
startups
startups.html
static
static-assets
statistics
statistics-glossary
statistics-glossary.html
stats
steak-au-poivre-classic-dish-170000422.html
steam_refunds
steamdeck
steamdeckdock
stec-tehnologii
stec-tehnologii.html
stephanie_hitchcock.html
stevie-nicks-releases-rousing-feminist-040028599.html
steward-health-ceo-refused-testify-195001803.html
stingsman21
stitek
stocks
stocks.html
stone-county-ar
storefront
stories
stories.html
stormyblueskys.html
story
story-charleston-told-oyster-okra-193817594.html
str
stranded-cancer-survivor-desperately-searches-190626199.html
stream
streaming
streaming-monitoring
streaming-monitoring.html
streaming.html
street-signs-asia
street-signs-asia.html
street-signs-europe
street-signs-europe.html
strompreisvergleich
strompreisvergleich.html
students
students.html
studies-and-reports
studies-and-reports.html
studio
studio.html
study
styl-zycia
style
style.html
styleguide
styles
sub
subject
subscribe
subscribe.38e4f979ba.html
subscribe.html
subscriber-terms-and-conditions.html
subscriber_agreement
subscription
subscription.191610357f.html
subscription.db3d6a3556.html
subscription.f059f21bc5.html
subscriptions
subscriptions.html
suche.bild.html
sunday-morning
sunday-morning.html
superdeal
superdeal.html
support
support-experience
support-experience.html
support-plans.html
support.html
supportticket.html
suscripciones
suscripciones.html
sv-fi
sv-fi.html
sv-se.html
svgs
sw_desktop.js
switch-to-android
switch-to-android.html
sydneywingfield
sydneywingfield.html
symbols
symptoms.html
szigi63.html
t
tab.html
tablets
tablets.html
tag
tagesgeld
tagesgeld.html
tags
taianwar.html
takano-eshop
takano-eshop.html
taliarebellious
taliarebellious.html
taliee-art
tampa-bay-rays-boston-red-sox-athlete-injuries-bc8b816ecd3347339d79b174cf343635.html
tangier-morocco
taos-nm
taos-ski-valley-nm
target-evad3rs.html
tarifa-spain
tariffs
tariffs.html
tarot-reading
tarot-reading.html
tastyhome.html
tax.html
taylor_steele
tcp-monitoring
tcp-monitoring.html
teacher-network.html
teal-pumpkins-blue-buckets-mean-213739385.html
team
team-playbook.html
team.html
teams
tech
tech-asia.html
tech-guide
tech-guide.html
tech-tonic.html
tech.html
technology
technology-transfer-spinoffs
technology-transfer-spinoffs.html
technology.html
techvalidate
tecnologia
tecnologia.html
telegram-svgrepo-com.svg
television
television-15c9041da8e74d42917966a13b863f92.html
television.html
templates
templates.html
tenby-united-kingdom
tennis-depot
tennis-depot.html
terminal.html
terms
terms-and-conditions
terms-and-conditions.html
terms-conditions.html
terms-of-service
terms-of-service.html
terms-of-use
terms-of-use-ru.pdf
terms-of-use.html
terms.html
termsandconditions
termsofservice.html
test.html
testimonials
testimonials.html
tests-procedures
tests-procedures.html
texas
texas-rangers-los-angeles-angels-8cf5ea40a9d24524b2b191a8159fc125.html
texas-united-states
texas.html
th
th-th
th.html
th_th.html
the-bonne-maman-advent-calendar-is-hereand-its-going-to-sell-out-soon-185002452.html
the-daily-report
the-daily-report.html
the-dish
the-dish.html
the-exchange
the-exchange.html
the-takeout
the-takeout.html
the-vergecast.html
the-villages-fl
theamityaffliction
theartfans.html
theatre.html
thebetterway.html
thecuteadopt.html
thefulkrum
theguardian
themaine.html
themanfromno.thing
themen
themes
there-are-early-prime-day-apple-deals-that-cant-be-ignored--including-a-record-low-ipad-over-100-off-182838788.html
think-know-altria-heres-1-222400053.html
thirstyrover
thiruvananthapuram-news.html
this-2-tier-under-sink-organizer-is-a-great-space-saver-and-today-its-30-off-194226920.html
this-descaler-makes-coffee-taste-way-better--and-its-just-13-for-a-3-pack-125927339.html
this-is-my-next.html
thomas-emmer-john-kirby-jeff-flake-marco-rubio-mark-kelly-625b77947cd61fac2752ce22bca58048.html
threads.html
tigeregern.html
tigles1artistry
tiktokers-touting-vibration-plates-health-120000675.html
tilt
tilt.html
timoong
timoong.html
tiny-florida-town-wiped-off-202931714.html
tips
tips-shopping-medicare-advantage-plans-130647037.html
tips-shopping-medicare-advantage-plans-130647454.html
tips.html
toca
toca.html
toconnect
toconnect.html
today-in-history
todos-os-sites
todos-os-sites.html
toggle_desktop_layout_cookie.html
token-unlocks
token-unlocks.html
tokens
tokens.html
tokilltheking
tom-huddleston-jr
tom-huddleston-jr.html
tone
tonya-parsons.html
toolkit
toolkit.html
tools
tools.html
top-rated
top-rated.html
top-sellers.html
top-youth-official-un-talks-125226915.html
topic
topics
topics-list.html
topics.53ca715bfa.html
topics.f0861376f0.html
topics.html
toronto-blue-jays-miami-marlins-7c7e25478b4e4fd5b0d2f8dad3d93df5.html
torremolinos-spain
touch-icon-ipad-retina.png
touch-icon-ipad.png
touch-icon-iphone-retina.png
touch-icon-iphone.png
tour-de-france
tour-de-france.html
tour.html
toyota.html
tr
tr-tr
tr-tr.html
tr_tr.html
trader-talk
trader-talk.html
trading
trading-platform
trading-platform.html
trading.html
trailers.html
transcripts.html
transfer-data-android-to-android
transfer-data-android-to-android.html
transparency-center
transparency-center.html
transportation
transportation.html
travel
travel.html
trd5ufrfu2q-123.html
treasury.html
tren-aragua-members-violent-venezuelan-221200159.html
trending
trending-cryptocurrencies
trending-cryptocurrencies.html
trends
trends.html
trials
trials.html
true-crime.html
truehealth
truehealth.html
trump-continues-warn-election-cheating-132241995.html
trump-escalates-dark-rhetoric-against-214142274.html
trump-savages-kamala-harris-fiery-223243014.html
trump-visits-wisconsin-town-shaken-232036718.html
trust
trust-center.html
trust.html
try
try.html
tsukiadoptshop
tsukiadoptshop.html
tucson-az
tupacshakurofficial
turystyka
tv
tv-and-radio
tv-and-radio.html
tv-schedule
tv-schedule.html
tv.html
tvandmovies.html
tw-zh
twilight_fanpire
two-wizards-bickering-nate-silver-205730144.html
twylasheridan
tychees
tychees.html
type
u
u-official-says-iran-preparing-145003338.html
ua-uk
udp-monitoring
udp-monitoring.html
ui
uk
uk-modern-slavery-act
uk-modern-slavery-act.html
uk-news
uk-news.html
uk-ua.html
ultimas-noticias
ultimas-noticias.html
ultimas.html
ultimate-electric-vehicle-ev-stock-223600009.html
unblock-eporner
unblock-eporner.html
uncharted
uncharted.html
undefined.html
une-information-transparente-franceinfo
une-information-transparente-franceinfo.html
united-kingdom
united-states
united-states-government-9568ae446766435a82222422bba42824.html
universa
universa.html
unrealjackalope
unterhaltung
upcoming
upcoming.html
updates
updates.html
uplift
uplift.html
upload
upload.html
us
us-elections.html
us-en
us-market-movers
us-market-movers.html
us-markets
us-markets-bundle
us-markets-bundle.html
us-markets.html
us-medicare-says-part-d-204943184.html
us-news
us-news.html
us-strengthens-lebanon-travel-advisory-203749906.html
us.34431ec9f7.html
us.html
use-case
use-cases
use-cases.html
user
user-research-community
user-research-community.html
users
usingthebbc
usr
uy-es
v
vagrantscout
vagrantscout.html
valeri-tafelvain-y
valerie-bertinelli-reveals-surprising-reason-201922696.html
vamosver.html
vance-faces-biggest-moment-political-210000279.html
vaticano
ve-es
venerdi
venerdi.html
veravernanda.html
vergleich
vergleich.html
versions
verwalten-sie-die-utiq-technologie-66573a2eaad9b31829419956.html
vi
vi-assets
vi_vn
vi_vn.html
video
video-019M7OeMGVT
video-0OvcWWth4tp
video-1mI6tL8Wx5Q
video-3HCNklXAHsZ
video-3fEIm2cHCjK
video-3wy6tkMV5HQ
video-4BBAoBAN7pz
video-5hzSitCCSnX
video-7VrMso1HNYs
video-7vuFC88WYGw
video-83gKDqwJ7kP
video-8mUJyP9gESP
video-9KYacC3PSpM
video-9MRE1gcRXB7
video-9kW267Sg5jt
video-9pzYqpKCmIc
video-9w60pR8wU22
video-AT6HzRzipHz
video-ATjt8PlTcax
video-C8ESoYBjq5m
video-DIag6Xfte7W
video-Dx3YZbwecBm
video-DyW6DERr7s0
video-E2hyffulSiW
video-EPbHFYIA3ob
video-ElQqGgbCOP0
video-F327o3senck
video-F5h61kS0yec
video-F8RWGAsrUAT
video-FUYcblpSqR8
video-FaWGE1JliI7
video-FhA4D4nsP8m
video-GYC9PBzjtoh
video-ImEJ1f5iMOS
video-IyR9GpavkPK
video-JQXWLakN1gH
video-JZDDyo0dSSN
video-KhcY65iIRyZ
video-L1ynUv6UF3R
video-L7SD1O9CQa2
video-LNpxXyoITzS
video-Lrnm8378xpW
video-M7r8nPI10bO
video-M8TB9XBPWbU
video-MKEEPv0D0Tm
video-MQdzWQ1RUVC
video-McNm2YNebBE
video-OK940uws5Sf
video-PS16B5v0JnW
video-Pl4Zqvo2x4M
video-QNNjFrGFIId
video-QUmihzderDR
video-Qcgq2aYLJO5
video-R8fOcApDXzN
video-RCSOoXI9jqG
video-RRyKrhsOvh9
video-RkLmeRGbNzg
video-RoMdI49pg06
video-SnusL9SlWK3
video-T1VReshfEWP
video-TFu5Whl4pLM
video-TGmSq9f60o7
video-TMGVpKtxV14
video-TPpelibgxGs
video-VCTdpO6uPkm
video-VkeozeLpk82
video-VwnmEWdipDp
video-X27Pvso0TG8
video-XJr6lU3R8wP
video-Y9hYss29x2r
video-YBf99vH75Zk
video-YTGo2mTvVhC
video-YkeC9Fqi5fX
video-YmDxtlRnwCd
video-Z6eSaOin8w2
video-ZK0ZZJmi3qB
video-advertising.html
video-anz3cuwhhWm
video-avcPgzlTJc2
video-bJtvp9tSDo2
video-bRWGq2SVdkg
video-bjUgE7JbMfB
video-cGJeAIbYbe2
video-ceo-interviews
video-ceo-interviews.html
video-dZS5qMKcBc6
video-dceCFFi6bMl
video-eiQjRw1KYif
video-fhC0HPM4Cge
video-gqFZ4rv4LYB
video-h8w96zMIuBH
video-hxygRVEQOdX
video-k6aTXNLYJ8J
video-kCNoHWRF8Gi
video-kPJWnCWKCEB
video-lYk0prSXztw
video-mGdp4732yrr
video-mgHLAkwyjsl
video-nOu7Q49LCfP
video-nciuqKslDcN
video-npBTDDCllRu
video-nyo4nOlw2Ib
video-p7RwzgeqE68
video-pIOEroRzdFW
video-qCPjxoHxCff
video-qtVHujNPAPm
video-sFTEwd1i3Hf
video-sPTJ0gESJnC
video-streaming-solutions
video-streaming-solutions.html
video-supposedly-showing-trump-diddy-205830388.html
video-tGVKiUPlyqe
video-tOFzC3zx9tS
video-tTKaEsdDVEy
video-taVP0V8uK6a
video-tb5kwvPx1BE
video-tr20vEAzaUN
video-u4XMquMbsMv
video-udejmhuXYhs
video-vHGDNBiOE96
video-vHstdwJrgN0
video-wQRgVbuciPN
video-wRREB91R94T
video-wYkG68TGj3q
video-wn2CBeQA14i
video-y5FCEW0jF7M
video-z3k6nX2ANr0
video-zSQXKoP5Ip0
video.html
videos
videos.html
vids
view
viewability
viewability.html
vinfo
violet02596
violet02596.html
virtual-events.html
vivabem
vivabem.html
vlp
vlp2
vn-vi
voices
voices.html
voidentir
voidentir.html
voodoochild4201.html
vouchercodes.html
vox
vox.html
voyages
voyages.html
vpn
vpn.html
vr
vrai-ou-fake
vrai-ou-fake.html
vrhardware
vueland
waco-tx
waffle-house-index-restaurant-chain-000425434.html
wakaflockaflame.html
wall-of-love
wall-of-love.html
wall-street-snack-success-charles-163552056.html
wallaroo-australia
walmart-holiday-deals-2024-here-is-everything-we-know-about-the-savings-event-plus-deals-to-shop-right-now-135448887.html
walz-claim-china-during-tiananmen-150500217.html
warehouse-native.html
warren-buffett-sold-11-stocks-075500251.html
washington-d-c-political-bar-230031618.html
watch
watch-baby-reaction-seeing-whole-200020273.html
watch-live-jd-vance-rallies-204500612.html
watch-live-news
watch-live-news.html
watchlist
watchlist.html
wayfairs-way-day-2024-sale-is-on-its-way-heres-what-we-know-so-far-plus-early-deals-to-shop-now-125502615.html
wbd
wcsstore
wdwparksgal
wealth
wealth.html
weather
weather.html
web-analytics.html
web-experimentation.html
web-monitoring
web-monitoring.html
web.html
webinar.html
webinars
webmanifest.json
webpack-runtime-a29f1f93979dc88929e3.js
webpack-runtime-e43a2d15ce522ec9d9ac.js
website
website-and-platform-user-privacy-policy.html
website-builder
website-builder.html
website-security.html
website-template
webstories
webstories.html
weddings.html
weekly.d3f9ba85aa.html
weightwatchers-ceo-oversaw-diet-companys-163536113.html
welcome-2023
welcome-2023.html
wellbeing
wellbeing.html
wellness
wellness.html
wells-next-the-sea-united-kingdom
what-is-a-debt-consolidation-loan-130235090.html
what-is-a-no-penalty-cd-165017820.html
what-is-a-reverse-mortgage-154616893.html
what-is-android
what-is-android.html
what-is-cloud-computing
what-is-cloud-computing.html
what-is-high-yield-checking-account-171337165.html
where-plug-power-3-years-220600170.html
whitby-united-kingdom
white-house
white-house.html
whitepapers
why-android
why-android.html
why-buying-home-could-easier-133008118.html
why-choose-hubspot.html
why-gene-therapy-sickle-cell-100242500.html
why-switch-to-android
why-switch-to-android.html
widget
widget-docs
widget.html
wiki
wiki.html
williamsburg-va
wirecutter
wirecutter.html
wix-capital.html
wmuc
won-knicks-wolves-blockbuster-does-003045696.html
word-search.html
wordpress
wordpress-hosting.html
wordpress.html
wordsmatter
work-management
workforce-identity
workforce-identity.html
working-at-statista
working-at-statista.html
worklife
workplace-apps.html
world
world-nation
world-nation.html
world-news
world-news.html
world.html
worthit.html
wp-apps
wp-content
wp-includes
wp-stat
wt17903fff1.html
wuwly
x-skeletta-x
y-setsubi
y-setsubi.html
yamada-denki
yamada-denki.html
yankees-anthony-rizzo-fractures-fingers-211509487.html
yckaden
yckaden.html
yearinreview.9227c65db1.html
yearinreview.9bc9eb392d.html
yellowcard.html
yield
yield.html
young-australia
your-privacy-choices
your-privacy-choices.html
your-taxes
your-taxes.html
youtube.html
z-sports
z-sports.html
za-en
zdroj
zelenskiy-says-trump-assured-him-185139747.html
zero-trust
zero-trust.html
zeus
zev-fima
zev-fima.html
zh
zh-hans
zh-hans.html
zh-hk
zh-hk.html
zh-my
zh-my.html
zh-sg
zh-sg.html
zh-us
zh-us.html
zh.html
zh_hk.html
zh_tw.html
devinat1 commented 1 month ago

This is upon scraping the top 100 sites from this list

devinat1 commented 1 month ago

What I ideally want is a directory for each, with CSS, HTML, and JS content for each site, and I wish to serve these sites myself.

devinat1 commented 1 month ago

So because of the way the clones were saved, opening one site opens a different site, so if I open the html for worldbank.org, it actually opens the HTML content of HP

janreges commented 1 month ago

Hi @devinat1,

First of all, I recommend adding to your script logging or displaying to the output final commands with all the parameters it sets.

In particular, for effective help, I need to know the final crawler --xyz command that your script is trying to run. From that, I can probably find out very quickly where the cause of any problem is.

By the way, I recommend adding --allowed-domain-for-external-files=* - this will ensure that also external JS/styles/fonts/images will be loaded from other domains. This is usually necessary, because many sites load e.g. JS libraries from CDN, etc.

For example, here is command for worldbank.org (just limited to 500 URLs):

./crawler \
  --url=https://www.worldbank.org/ \
  --max-visited-urls=500 \
  --offline-export-dir=tmp/worldbank.org \
  --allowed-domain-for-external-files=*

And here is tmp/worldbank.org directory content ... exported website works nice. The directories starting with an underscore _ are external domains, from which external assets were downloaded to make the web work as well as possible in offline form and contain all JavaScripts, images, fonts, etc.

image

devinat1 commented 1 month ago

Hi @janreges thank you for your feedback. Here is one of the commands that my crawler script is running: crawler '--url=office365.com', '--offline-export-dir=/home/bond/Desktop/agent-collector/utils/website-scraper/../../data/synthetic/clones/', '--workers=10', '--max-visited-urls=500', '--allowed-domain-for-external-files=*', '--ignore-robots-txt'

The same issue occurs with your suggestion of allowing domains for external files.

devinat1 commented 1 month ago

This is the exact output I am getting upon running the scraper:

 ####                ####             #####        
 ####                ####           #######        
 ####      ###       ####         #########        
 ####     ######     ####       ###### ####        
  ######################       #####   ####        
    #######    #######       #####     ####        
    #######    #######         #       ####        
  ######################               ####        
 ####     ######     ####              ####        
 ####       ##       ####              ####        
 ####                ####       ################## 
 ####                ####       ################## 

==================================================
# SiteOne Crawler, v1.0.8.20240824               #
# Author: jan.reges@siteone.cz                   #
==================================================

Progress report           | URL                                                                                 | Status | Type     | Time   | Size    | Access.  | Best pr.
-----------------------------------------------------------------------------------------------------------------------------------------------------------------------------
1/2     | 50%  |>>>>>     | /                                                                                   | 301    | Redirect | 423 ms | 133 B   |          |         
2/2     | 100% |>>>>>>>>>>| https://azure.microsoft.com/en-us/                                                  | 200    | HTML     | 641 ms | 493 kB  |          | 2/5     

Redirected URLs
---------------

Status | Redirected URL                                       | Target URL                                           | Found at URL                                        
-----------------------------------------------------------------------------------------------------------------------------------------------------------------------------
301    | /                                                    | https://azure.microsoft.com/en-us/                   |                                                     

404 URLs
--------

No 404 URLs found.

SSL/TLS info
------------

Info                   | Text                                                                                                                                              
-----------------------------------------------------------------------------------------------------------------------------------------------------------------------------
Issuer                 | C=US, O=Microsoft Corporation, CN=Microsoft Azure RSA TLS Issuing CA 08                                                                           
Subject                | C=US, ST=WA, L=Redmond, O=Microsoft Corporation, CN=gamedev.microsoft.com                                                                         
Valid from             | Sep 10 18:13:29 2024 GMT (VALID already 23.3 day(s))                                                                                   
Valid to               | Sep  5 18:13:29 2025 GMT (VALID still for 336.7 day(s))                                                                                
Supported protocols    | TLSv1.2, TLSv1.3                                                                                                            
RAW certificate output | Certificate:
    Data:
        Version: 3 (0x2)
        Serial Number:
            33:00:6c:7f:df:…6:a6:b2:28:28:
        8a:f7:d1:23:5c:b9:bd:87
RAW protocols output   | Connecting to 20.231.239.246
depth=2 C=US, O=DigiCert Inc, OU=www.digicert.com, CN=DigiCert Global…s not sent
Verify return code: 0 (ok)
---
DONE

TOP fastest URLs
----------------

No fast URLs fastest than 1 second(s) found.

TOP slowest URLs
----------------

No slow URLs slowest than 0.01 second(s) found.

SEO metadata
------------

No URLs.

OpenGraph metadata
------------------

No URLs with OpenGraph data (og:* or twitter:* meta tags).

Heading structure
-----------------

No URLs to analyze heading structure.

HTTP headers
------------

Header                    | Occurs | Unique | Values preview                                                                         | Min value  | Max value 
----------------------------------------------------------------------------------------------------------------------------------------------------------------
Connection                | 1      | 1      | close                                                                                  |            |           
Content-Length            | 1      | -      | [ignored generic values]                                                               | 0 B        | 0 B       
Content-Type              | 1      | 1      | text/html                                                                              |            |           
Date                      | 1      | -      | [ignored generic values]                                                               | 2024-09-29 | 2024-09-29
Location                  | 1      | 1      | https://azure.microsoft.com/en-us/                                                     |            |           
Server                    | 1      | 1      | Kestrel                                                                                |            |           
Strict-Transport-Security | 1      | 1      | max-age=31536000                                                                       |            |           

HTTP header values
------------------

Header                    | Occurs | Value                                                                                                                   
---------------------------------------------------------------------------------------------------------------------------------------------------------------
Connection                | 1      | close                                                                                                                   
Content-Type              | 1      | text/html                                                                                                               
Location                  | 1      | https://azure.microsoft.com/en-us/                                                                                      
Server                    | 1      | Kestrel                                                                                                                 
Strict-Transport-Security | 1      | max-age=31536000                                                                                                        

Best practices
--------------

Analysis name                            | OK    | Notice | Warning | Critical
--------------------------------------------------------------------------------
Large inline SVGs (> 5120 B)             | 2     | 0      | 0       | 0       
Invalid inline SVGs                      | 2     | 0      | 0       | 0       
Duplicate inline SVGs (> 5 and > 1024 B) | 2     | 0      | 0       | 0       
DOM depth (> 30)                         | 0     | 0      | 1       | 0       
Heading structure                        | 1     | 0      | 1       | 0       
Title uniqueness (> 10%)                 | 0     | 0      | 1       | 0       
Description uniqueness (> 10%)           | 0     | 0      | 1       | 0       
Brotli support                           | 0     | 0      | 0       | 0       
WebP support                             | 0     | 0      | 1       | 0       
AVIF support                             | 0     | 0      | 1       | 0       

Accessibility
-------------

Nothing to report.

Source domains
--------------

Domain              | Totals        | HTML          | Redirect    
--------------------------------------------------------------------
azure.com           | 1/133B/423ms  |               | 1/133B/423ms
azure.microsoft.com | 1/493kB/641ms | 1/493kB/641ms |             

Content types
-------------

Content type | URLs  | Total size | Total time | Avg time | Status 20x | Status 30x
-------------------------------------------------------------------------------------
HTML         | 1     | 493 kB     | 641 ms     | 641 ms   | 1          | 0         
Redirect     | 1     | 133 B      | 423 ms     | 423 ms   | 0          | 1         

Content types (MIME types)
--------------------------

Content type               | URLs  | Total size | Total time | Avg time | Status 20x | Status 30x
---------------------------------------------------------------------------------------------------
text/html                  | 1     | 133 B      | 423 ms     | 423 ms   | 0          | 1         
text/html;charset=utf-8    | 1     | 493 kB     | 641 ms     | 641 ms   | 1          | 0         

DNS info
--------

DNS resolving tree                                                    
------------------------------------------------------------------------
azure.com                                                             
  IPv4: 20.231.239.246                                                
  IPv4: 20.112.250.133                                                
  IPv4: 20.236.44.162                                                 
  IPv4: 20.70.246.20                                                  
  IPv4: 20.76.201.171                                                 

DNS server: 127.0.0.53                                                

Security
--------

Nothing to report.

Analysis stats
--------------

Class::method                                        | Exec time | Exec count
-------------------------------------------------------------------------------
SslTlsAnalyzer::getTLSandSSLCertificateInfo          | 917 ms    | 1         
Manager::parseDOMDocument                            | 96 ms     | 1         
BestPracticeAnalyzer::checkMissingQuotesOnAttributes | 21 ms     | 1         
BestPracticeAnalyzer::checkNonClickablePhoneNumbers  | 14 ms     | 1         
BestPracticeAnalyzer::checkMaxDOMDepth               | 12 ms     | 1         
BestPracticeAnalyzer::checkHeadingStructure          | 4 ms      | 1         
BestPracticeAnalyzer::checkInlineSvg                 | 1 ms      | 1         
SeoAndOpenGraphAnalyzer::analyzeSeo                  | 0 ms      | 1         
SeoAndOpenGraphAnalyzer::analyzeOpenGraph            | 0 ms      | 1         
SeoAndOpenGraphAnalyzer::analyzeHeadings             | 0 ms      | 1         
BestPracticeAnalyzer::checkTitleUniqueness           | 0 ms      | 1         
BestPracticeAnalyzer::checkBrotliSupport             | 0 ms      | 1         
BestPracticeAnalyzer::checkMetaDescriptionUniqueness | 0 ms      | 1         
BestPracticeAnalyzer::checkWebpSupport               | 0 ms      | 1         
BestPracticeAnalyzer::checkAvifSupport               | 0 ms      | 1         

Content processor stats
-----------------------

Class::method                                            | Exec time | Exec count
-----------------------------------------------------------------------------------
HtmlProcessor::findUrls                                  | 3 ms      | 1         
HtmlProcessor::applyContentChangesForOfflineVersion      | 3 ms      | 1         
NextJsProcessor::applyContentChangesBeforeUrlParsing     | 0 ms      | 1         
HtmlProcessor::applyContentChangesBeforeUrlParsing       | 0 ms      | 2         
AstroProcessor::applyContentChangesBeforeUrlParsing      | 0 ms      | 1         
JavaScriptProcessor::applyContentChangesBeforeUrlParsing | 0 ms      | 1         
SvelteProcessor::applyContentChangesBeforeUrlParsing     | 0 ms      | 1         
CssProcessor::applyContentChangesBeforeUrlParsing        | 0 ms      | 1         

================================================================================================================================================================================
Total execution time 3.2 s using 10 workers and 2048M memory limit (max used 8 MB)
Total of 2 visited URLs with a total size of 493 kB and power of 0 reqs/s with download speed 152 kB/s
Response times: AVG 532 ms MIN 423 ms MAX 641 ms TOTAL 1.1 s
================================================================================================================================================================================

Summary
-------

⚠️ No titles provided for uniqueness check.
⚠️ No meta descriptions provided for uniqueness check.
⚠️ No WebP image found on the website.
⚠️ No AVIF image found on the website.
⚠️ 1 page(s) with skipped heading levels.
⚠️ 1 page(s) with deep DOM (> 30 levels).
⏩ Redirects - 1 redirect(s) found.
⏩ DNS IPv6: domain azure.com does not support IPv6 (DNS server: 127.0.0.53).
✅ 404 OK - all pages exists, no non-existent pages found.
✅ SSL/TLS certificate is valid until Sep  5 18:13:29 2025 GMT. Issued by C=US, O=Microsoft Corporation, CN=Microsoft Azure RSA TLS Issuing CA 08. Subject is C=US, ST=WA, L=Redmond, O=Microsoft Corporation, CN=gamedev.microsoft.com.
✅ SSL/TLS certificate issued by 'C=US, O=Microsoft Corporation, CN=Microsoft Azure RSA TLS Issuing CA 08'.
✅ Performance OK - all non-media URLs are faster than 3 seconds.
✅ HTTP headers - found 7 unique headers.
✅ All pages support Brotli compression.
✅ All pages have quoted attributes.
✅ All pages have inline SVGs smaller than 5120 bytes.
✅ All pages have inline SVGs with less than 5 duplicates.
✅ All pages have valid or none inline SVGs.
✅ All pages without multiple <h1> headings.
✅ All pages have <h1> heading.
✅ All pages have clickable (interactive) phone numbers.
✅ All pages have valid HTML.
✅ All pages have image alt attributes.
✅ All pages have form labels.
✅ All pages have aria labels.
✅ All pages have role attributes.
✅ All pages have lang attribute.
✅ DNS IPv4 OK: domain azure.com resolved to 20.231.239.246, 20.112.250.133, 20.236.44.162, 20.70.246.20, 20.76.201.171 (DNS server: 127.0.0.53).
✅ Security - no findings.
📌 Text report saved to '/usr/local/siteone-crawler/tmp/azure.com.output.20241004-021643.txt' and took 0 ms.
📌 JSON report saved to '/usr/local/siteone-crawler/tmp/azure.com.output.20241004-021643.json' and took 1 ms.
📌 HTML report saved to '/usr/local/siteone-crawler/tmp/azure.com.report.20241004-021643.html' and took 37 ms.
📌 Offline website generated to '/home/bond/Desktop/agent-collector/utils/website-scraper/../../data/synthetic/clones/azure' and took 4 ms.

And the offline website generated just gives the following: <meta http-equiv="refresh" content="0; url=https://azure.microsoft.com/en-us/"> Redirecting to https://azure.microsoft.com/en-us/ ...

devinat1 commented 1 month ago

The bad directory issue was on my side, but I am unsure as to how to resolve the above issue.

janreges commented 1 month ago

The problem is that you export all the sites to the same folder “clones”. For each domain you run the crawler for, you have to dedicate its own folder. So instead of the "clones" folder, define for example "clones/worldbank.org".

I'll add this information to the documentation as well to make it clear.

janreges commented 1 month ago

As for azure.com - this domain redirects to a completely different domain.

There is a mechanism in the crawler to allow the crawler to follow the redirect for the first defined URL and crawl the entire other domain as well, but only if the 2nd tier domain has not changed.

So this will work correctly if, for example, the --url domain "abc.com" redirects to "www.abc.com", or vice versa "www.abc.com" redirects to "abc.com", or the domain "abc.com" redirects to "subdomain.en.abc.com".

In that case, for Microsoft pages, could help --url=https://www.azure.com --allowed-domain-for-crawling='*.microsoft.com'