Robust checker of whether a URI is live or not

opoudjis commented 8 months ago

As a result of https://github.com/metanorma/metanorma-iso/issues/1114, I have enabled code that was previously deactivated, to check whether a URI in a bibliographic entry is active or not. This is done in case the bibliography requires a date last accessed to be supplied, and it hasn't been already.

https://github.com/relaton/relaton-render/blob/main/lib/relaton/render/general/uri.rb

The problem is, it isn't working well, and it needs someone who understands fetching better than me to fix it.

For example:

      def url_exist?(url_string)
        url = URI.parse(url_string)
        url.host or return true # allow file URLs
        res = access_url(url) or return false
        res.is_a?(Net::HTTPRedirection) and return url_exist?(res["location"])
        res.code[0] != "4"
      rescue Errno::ENOENT, SocketError
        false # false if can't find the server
      end

seems to be in an infinite loop of redirections triggered by https://dl.acm.org/doi/10.1145/3425898.3426958

It is returning HTTP 302 Found, but it is a redirection. The problem is, it's a redirection to a cookie query, https://dl.acm.org/doi/10.1145/3425898.3426958?cookieSet=1, and that ends up in an infinite loop. Clearly res.is_a?(Net::HTTPRedirection) is naive, but TBH I don't have the headspace to make this robust.

PDFs are routinely returning false on res.code[0] != "4"; so

http://www.tandfonline.com/doi/abs/10.1111/j.1467-8306.2004.09402005.x is returning HTTP 301 Moved Permanently, which really is a redirect, and its res["location"] is still https://www.tandfonline.com/doi/abs/10.1111/j.1467-8306.2004.09402005.x. When I access that, I get HTTP 403 Forbidden. But I expect to get HTTP 403 for a paywalled resource! The gem should not be reporting a failure there.

So this needs a smarter treatment of possible HTTP codes. Really, the only case where a URI is invalid is (I think) 404 or 50x. But I don't want to do this, I want someone else to do this, that is familiar with HTTP codes and paywalled content and redirects.

I do not agree with Ronald that a new gem is required, but I'm asking that someone else handles this. For now, I'm doing a hotfix that passes all URIs it sees.

opoudjis commented 8 months ago

if you cannot do this, @andrew2net, we can give this to @alexeymorozov . As long as it's not me :)

andrew2net commented 8 months ago

@opoudjis it would be helpful if you give this to someone else.

opoudjis commented 8 months ago

Fair. @alexeymorozov ?

ronaldtse commented 8 months ago

There are many things to fix here:

We should only care about the head status response instead of actually retrieving the whole URL response. The URL response can be 100MB.
A URI is not meant to be resolved. ONLY a URL is meant to be accessible.
There are MANY sites that require browser access (cookies, etc) or JS.
It is very difficult to check whether a document is still available, it probably requires some intelligent mechanism to determine. This is a confidence issue. Maybe the page says "Not found" but the status code is 200...

opoudjis commented 8 months ago

There are many things to fix here:

We should only care about the head status response instead of actually retrieving the whole URL response. The URL response can be 100MB.

Already being done

A URI is not meant to be resolved. ONLY a URL is meant to be accessible.

Fine.

There are MANY sites that require browser access (cookies, etc) or JS.

And like I said, HTTP 403 Forbidden has to be assumed to be a valid URL.

It is very difficult to check whether a document is still available, it probably requires some intelligent mechanism to determine. This is a confidence issue. Maybe the page says "Not found" but the status code is 200...

It's a best effort check. The really authoritative way of doing this is for the author to insert a manual accessed date, to indicate that they have physically sighted it.

opoudjis commented 1 month ago

No developer is currently working on this. @ronaldtse This needs to be addressed.

relaton / relaton-render

Robust checker of whether a URI is live or not #50