janreges / siteone-crawler

SiteOne Crawler is a cross-platform website crawler and analyzer for SEO, security, accessibility, and performance optimization—ideal for developers, DevOps, QA engineers, and consultants. Supports Windows, macOS, and Linux (x64 and arm64).
https://crawler.siteone.io/
MIT License
254 stars 17 forks source link

Issue: Clones of sites do not show issues until hover #24

Open devinat1 opened 1 month ago

devinat1 commented 1 month ago

I am getting the following issue with the crawler offline sites: https://www.loom.com/share/755b0efd840c48fc8f6f0be0114c6e8e I can only view image to the article upon hover.

janreges commented 1 month ago

Hi @devinat1,

on the BBC website there is some logic related to JavaScript support and displaying placeholders before images. Unfortunately, I don't have time to analyse all the JavaScript on the BBC website in detail to understand why this is happening.

In the BBC case, to display the images, it is necessary to remove the placeholder tag with hide-when-no-script class, that overlays the images.

In order to perform such replacements, I have implemented a new --replace-content option. Possible values are old -> new or /old-regex/ -> new. The only way to use this option is to run a version of the crawler from the current main branch of Git. If you are using macOS, the instructions are here: https://crawler.siteone.io/installation-and-requirements/manual-installation/#macos-x64-intel

Below is a realistic example with which the already cloned BBC website displays the images correctly.

./crawler \
  --url=https://www.bbc.com/ \
  --max-visited-urls=500 \
  --offline-export-dir=tmp/bbc.com \
  --replace-content='/<img[^>]+class="[^"]*hide-when-no-script[^"]*"[^>]*>/i -> '