vincent-peugnet / wcms

⧉ light-weight experimental wiki
http://w.club1.fr
GNU Affero General Public License v3.0
20 stars 5 forks source link

dead link checker 💀🔩 #322

Open vincent-peugnet opened 1 year ago

vincent-peugnet commented 1 year ago

Check the external links and store them somewhere as cache. Add a class to the link tags, something like dead or exist_not (which is actually used for internal links).

just check the server response using get-headers PHP function.

Cache should have a perish date around one month.

Dead links could be stored in pages and listed in the home view, or, associated pages would be stored with the cache, so it could be explored by editors.

vincent-peugnet commented 1 year ago

I can imagine a more general idea:

Instead of just storing dead links, I could store all external links with their associated HTTP response code

{
    "extlink": {
        "https://sdf.com": 200,
        "https://qepeor.fr": 404
    }
}

But at this point, why not also store the date it was checked?

So there will be an external cache. What I imagine would be a single JSON file with URL as keys.

"https://apodo.com/dfdf.html": {
    "date": "2023-06-23",
    "response": 200,
}

Using this in home view

This add a lot of new datas to pages!

I could use it to add new columns:

Their count will be accessed by two new Page's methods:

$page->countextlink(); // To count extnernal links
$page->countextdeadlink(); // To count extnernal dead links

Right now, if I wanted to achieve this with more efficiency, I should store links it two different arrays (valid and dead links), but maybe in the future, I would also take advantage of other codes, like redirections. It could also be displayed somewhere so editors could have the precise error code.

vincent-peugnet commented 3 weeks ago

I'm working on it on branch https://github.com/vincent-peugnet/wcms/tree/dead-link-checker

I'm adding HTML classes to link that are checked. There is two options:

  1. if it reach a 200 ok (could be after redirections)
  2. did'ny reach a 200 ok

For now, I've choose to name 1: ok and 2: dead.

Or maybe, I should follow the internal link syntax: exist and existnot. (see manual section about classes in links)

I'm curious about other point of views !

vincent-peugnet commented 3 weeks ago

I did some real life test (using my personal webpage https://246.eu/bac, which contain 50 links).

3 of them didn't received a 200 response, although that when visited from the browser it's fine.

I started investigate: For the 3 adresses, I received 403 responses (forbidden) And it seems to be anti scrapping strategy used by CDN like Cloudflares. 🥲

Maybe I should just accept 403 as non dead pages ?

vincent-peugnet commented 3 weeks ago

I tried some workaround, to fool the CDN. I tried to:

  1. add a user agent Mozilla/5.0
  2. add a complex user agent Mozilla/5.0 (iPad; CPU OS 12_2 like Mac OS X) AppleWebKit/605.1.15 (KHTML, like Gecko) Mobile/15E148
  3. add custom headers picked from Web scrapping tutorials.

On a first URL, it did'nt changed anything.

On a second one, just setting user agent (1) did the work.

Both sent the following header:

'Server': 'cloudflare'
vincent-peugnet commented 3 weeks ago

👀 some inspiration from the Sphinx project