Closed pdehaan closed 6 months ago
Thinking about this a bit more (and removing the --silent
flag, I'm not sure how well this will work. It doesn't seem to recursively go through linked internal page as I first suspected.
So if we want a FULL link check, we might have to specify pages specifically (which means explicitly searching each breach detail page). For example, currently https://www.toondoo.com
and http://toondoo.com
are offline, so that link at the top of the breach detail page 404s.
Ah, OK, so it looks like view-source:https://monitor.firefox.com/breaches has the following markup:
<!-- breach cards -->
<div id="all-breaches" class="all-breaches flx"></div>
And it looks like we're injecting the breach-cards into the DOM async, which explains why a link checker might not pick those links up when scanning the DOM for anchor tags. We'll probably have to use the API and scan the /hibp/breaches endpoint and scan each /breach-details/${breach.Name}
page separately.
Moving initial work to https://github.com/pdehaan/blurts-link-checker, so I stop polluting this GitHub threads and everybody's notifications.
Current implementation works, but is sloooow to scrape+recurse 420-ish links.
Closing since we've redesigned the site and functionality since this was created. If you feel that this is still needed, please let me know.
Not sure if we want to use this for anything, but it's pretty interesting, and may be useful in some sort of pre-prod deploy checklist.
Actually, this seems to have changed very recently. Looks like we did a production push and just added that missing ToonDoo.png logo.
As for the HIBP link which is 403ing, we can either remove the "www." subdomain from the link, or just add a config to ignore that domain/error/403: