dead link checker 💀🔩

vincent-peugnet commented 1 year ago

Check the external links and store them somewhere as cache. Add a class to the link tags, something like dead or exist_not (which is actually used for internal links).

just check the server response using get-headers PHP function.

Cache should have a perish date around one month.

Dead links could be stored in pages and listed in the home view, or, associated pages would be stored with the cache, so it could be explored by editors.

[x] extract links during render d61d50c201002270ed14e46b24954a0cbe0cfeaf
- [x] checkbox in home menu to avoid url checking in multi render 042971c5e878febe06c422071e8e87951728b607
- [x] store external links in Pages 3d8ed4cfc0c43a14a37cabaa73ebc653e0858a89
[x] check URLs header responses d61d50c201002270ed14e46b24954a0cbe0cfeaf
- [x] consider 401 and 403 as non dead urls (reacting to CDN anti scrapping practices)
[x] add HTML classes to <a>
- [ ] choose the name of the classes see comment
  - [ ] Manual entry
[x] use a cache 337e61e873c9fe9ab93b1447c756536936e0b455
- [x] choose expire time: 90 days.
- [x] add a button to flush the cache 042971c5e878febe06c422071e8e87951728b607
[x] display external links count in home main table cc1373e66b0afc4fe5b4db8f4264972a998afbdc
- [x] display dead links
- [x] display un-checked links d5acc25fa1a6a86bc8228eccbd304a849ac0f699
[x] display external links in graph 9b2be8b072251974bbdaf3b0c981c15ec411a62f
[x] limit Web checkout duration d5acc25fa1a6a86bc8228eccbd304a849ac0f699
- [x] use different timing for different scenarios.
[x] unchecked URLs, or too old page containing external links may cause re-rendering d5acc25fa1a6a86bc8228eccbd304a849ac0f699
[x] Config option to disable URL checking 354085ef43297af3d17264517454ab739dc75a51

vincent-peugnet commented 1 year ago

I can imagine a more general idea:

Instead of just storing dead links, I could store all external links with their associated HTTP response code

{
    "extlink": {
        "https://sdf.com": 200,
        "https://qepeor.fr": 404
    }
}

But at this point, why not also store the date it was checked?

because this is useless data added to the page JSON that will be loaded each time a page is loaded.
if multiple pages have links pointing to the sames URL, this cannot be put in common. Imagine a external link in a template: it will be checked for every page.

So there will be an external cache. What I imagine would be a single JSON file with URL as keys.

"https://apodo.com/dfdf.html": {
    "date": "2023-06-23",
    "response": 200,
}

Using this in home view

This add a lot of new datas to pages!

I could use it to add new columns:

number of external links
number of dead links

Their count will be accessed by two new Page's methods:

$page->countextlink(); // To count extnernal links
$page->countextdeadlink(); // To count extnernal dead links

Right now, if I wanted to achieve this with more efficiency, I should store links it two different arrays (valid and dead links), but maybe in the future, I would also take advantage of other codes, like redirections. It could also be displayed somewhere so editors could have the precise error code.

vincent-peugnet commented 3 weeks ago

I'm working on it on branch https://github.com/vincent-peugnet/wcms/tree/dead-link-checker

I'm adding HTML classes to link that are checked. There is two options:

if it reach a 200 ok (could be after redirections)
did'ny reach a 200 ok

For now, I've choose to name 1: ok and 2: dead.

Or maybe, I should follow the internal link syntax: exist and existnot. (see manual section about classes in links)

I'm curious about other point of views !

vincent-peugnet commented 3 weeks ago

I did some real life test (using my personal webpage https://246.eu/bac, which contain 50 links).

3 of them didn't received a 200 response, although that when visited from the browser it's fine.

I started investigate: For the 3 adresses, I received 403 responses (forbidden) And it seems to be anti scrapping strategy used by CDN like Cloudflares. 🥲

Maybe I should just accept 403 as non dead pages ?

vincent-peugnet commented 3 weeks ago

I tried some workaround, to fool the CDN. I tried to:

add a user agent Mozilla/5.0
add a complex user agent Mozilla/5.0 (iPad; CPU OS 12_2 like Mac OS X) AppleWebKit/605.1.15 (KHTML, like Gecko) Mobile/15E148
add custom headers picked from Web scrapping tutorials.

On a first URL, it did'nt changed anything.

On a second one, just setting user agent (1) did the work.

Both sent the following header:

'Server': 'cloudflare'

vincent-peugnet commented 3 weeks ago

👀 some inspiration from the Sphinx project

vincent-peugnet / wcms

dead link checker 💀🔩 #322

But at this point, why not also store the date it was checked?

Using this in home view