mozilla / blurts-server

Mozilla Monitor arms you with tools to keep your personal information safe. Find out what hackers already know about you and learn how to stay a step ahead of them.
https://monitor.mozilla.org
Mozilla Public License 2.0
713 stars 203 forks source link

Link checker #1371

Closed pdehaan closed 6 months ago

pdehaan commented 4 years ago

Not sure if we want to use this for anything, but it's pretty interesting, and may be useful in some sort of pre-prod deploy checklist.

npx linkinator https://monitor.firefox.com -r --silent --no-color

[404] https://monitor.firefox.com/img/logos/ToonDoo.png
[403] https://www.haveibeenpwned.com/

https://monitor.firefox.com
  [404] https://monitor.firefox.com/img/logos/ToonDoo.png
https://monitor.firefox.com/breaches
  [403] https://www.haveibeenpwned.com/
ERROR: Detected 2 broken links. Scanned 22 links in 2.18 seconds.

Actually, this seems to have changed very recently. Looks like we did a production push and just added that missing ToonDoo.png logo.

As for the HIBP link which is 403ing, we can either remove the "www." subdomain from the link, or just add a config to ignore that domain/error/403:

npx linkinator https://monitor.firefox.com -r --silent --skip www.haveibeenpwned.com --no-color
pdehaan commented 4 years ago

Thinking about this a bit more (and removing the --silent flag, I'm not sure how well this will work. It doesn't seem to recursively go through linked internal page as I first suspected.

So if we want a FULL link check, we might have to specify pages specifically (which means explicitly searching each breach detail page). For example, currently https://www.toondoo.com and http://toondoo.com are offline, so that link at the top of the breach detail page 404s.

Ah, OK, so it looks like view-source:https://monitor.firefox.com/breaches has the following markup:

<!-- breach cards -->
<div id="all-breaches" class="all-breaches flx"></div>

And it looks like we're injecting the breach-cards into the DOM async, which explains why a link checker might not pick those links up when scanning the DOM for anchor tags. We'll probably have to use the API and scan the /hibp/breaches endpoint and scan each /breach-details/${breach.Name} page separately.

pdehaan commented 4 years ago

Moving initial work to https://github.com/pdehaan/blurts-link-checker, so I stop polluting this GitHub threads and everybody's notifications.

Current implementation works, but is sloooow to scrape+recurse 420-ish links.

EMMLynch commented 6 months ago

Closing since we've redesigned the site and functionality since this was created. If you feel that this is still needed, please let me know.