Open devinat1 opened 1 month ago
Hi @devinat1,
on the BBC website there is some logic related to JavaScript support and displaying placeholders before images. Unfortunately, I don't have time to analyse all the JavaScript on the BBC website in detail to understand why this is happening.
In the BBC case, to display the images, it is necessary to remove the placeholder tag with hide-when-no-script
class, that overlays the images.
In order to perform such replacements, I have implemented a new --replace-content
option. Possible values are old -> new
or /old-regex/ -> new
. The only way to use this option is to run a version of the crawler from the current main
branch of Git. If you are using macOS, the instructions are here: https://crawler.siteone.io/installation-and-requirements/manual-installation/#macos-x64-intel
Below is a realistic example with which the already cloned BBC website displays the images correctly.
./crawler \
--url=https://www.bbc.com/ \
--max-visited-urls=500 \
--offline-export-dir=tmp/bbc.com \
--replace-content='/<img[^>]+class="[^"]*hide-when-no-script[^"]*"[^>]*>/i -> '
I am getting the following issue with the crawler offline sites: https://www.loom.com/share/755b0efd840c48fc8f6f0be0114c6e8e I can only view image to the article upon hover.