Open vincent-peugnet opened 1 year ago
I can imagine a more general idea:
Instead of just storing dead links, I could store all external links with their associated HTTP response code
{
"extlink": {
"https://sdf.com": 200,
"https://qepeor.fr": 404
}
}
So there will be an external cache. What I imagine would be a single JSON file with URL as keys.
"https://apodo.com/dfdf.html": {
"date": "2023-06-23",
"response": 200,
}
This add a lot of new datas to pages!
I could use it to add new columns:
Their count will be accessed by two new Page's methods:
$page->countextlink(); // To count extnernal links
$page->countextdeadlink(); // To count extnernal dead links
Right now, if I wanted to achieve this with more efficiency, I should store links it two different arrays (valid and dead links), but maybe in the future, I would also take advantage of other codes, like redirections. It could also be displayed somewhere so editors could have the precise error code.
I'm working on it on branch https://github.com/vincent-peugnet/wcms/tree/dead-link-checker
I'm adding HTML classes to link that are checked. There is two options:
200 ok
(could be after redirections)200 ok
For now, I've choose to name 1: ok
and 2: dead
.
Or maybe, I should follow the internal link syntax: exist
and existnot
. (see manual section about classes in links)
I'm curious about other point of views !
I did some real life test (using my personal webpage https://246.eu/bac, which contain 50 links).
3 of them didn't received a 200 response, although that when visited from the browser it's fine.
I started investigate: For the 3 adresses, I received 403 responses (forbidden) And it seems to be anti scrapping strategy used by CDN like Cloudflares. 🥲
Maybe I should just accept 403 as non dead pages ?
I tried some workaround, to fool the CDN. I tried to:
Mozilla/5.0
Mozilla/5.0 (iPad; CPU OS 12_2 like Mac OS X) AppleWebKit/605.1.15 (KHTML, like Gecko) Mobile/15E148
On a first URL, it did'nt changed anything.
On a second one, just setting user agent (1) did the work.
Both sent the following header:
'Server': 'cloudflare'
Check the external links and store them somewhere as cache. Add a class to the link tags, something like
dead
orexist_not
(which is actually used for internal links).just check the server response using get-headers PHP function.
Cache should have a perish date around one month.
Dead links could be stored in pages and listed in the home view, or, associated pages would be stored with the cache, so it could be explored by editors.
401
and403
as non dead urls (reacting to CDN anti scrapping practices)<a>