Open opoudjis opened 8 months ago
if you cannot do this, @andrew2net, we can give this to @alexeymorozov . As long as it's not me :)
@opoudjis it would be helpful if you give this to someone else.
Fair. @alexeymorozov ?
There are many things to fix here:
There are many things to fix here:
- We should only care about the head status response instead of actually retrieving the whole URL response. The URL response can be 100MB.
Already being done
- A URI is not meant to be resolved. ONLY a URL is meant to be accessible.
Fine.
- There are MANY sites that require browser access (cookies, etc) or JS.
And like I said, HTTP 403 Forbidden has to be assumed to be a valid URL.
- It is very difficult to check whether a document is still available, it probably requires some intelligent mechanism to determine. This is a confidence issue. Maybe the page says "Not found" but the status code is 200...
It's a best effort check. The really authoritative way of doing this is for the author to insert a manual accessed date, to indicate that they have physically sighted it.
No developer is currently working on this. @ronaldtse This needs to be addressed.
As a result of https://github.com/metanorma/metanorma-iso/issues/1114, I have enabled code that was previously deactivated, to check whether a URI in a bibliographic entry is active or not. This is done in case the bibliography requires a date last accessed to be supplied, and it hasn't been already.
https://github.com/relaton/relaton-render/blob/main/lib/relaton/render/general/uri.rb
The problem is, it isn't working well, and it needs someone who understands fetching better than me to fix it.
For example:
seems to be in an infinite loop of redirections triggered by
https://dl.acm.org/doi/10.1145/3425898.3426958
It is returning HTTP 302 Found, but it is a redirection. The problem is, it's a redirection to a cookie query, https://dl.acm.org/doi/10.1145/3425898.3426958?cookieSet=1, and that ends up in an infinite loop. Clearly
res.is_a?(Net::HTTPRedirection)
is naive, but TBH I don't have the headspace to make this robust.PDFs are routinely returning false on
res.code[0] != "4"
; sohttp://www.tandfonline.com/doi/abs/10.1111/j.1467-8306.2004.09402005.x is returning HTTP 301 Moved Permanently, which really is a redirect, and its res["location"] is still https://www.tandfonline.com/doi/abs/10.1111/j.1467-8306.2004.09402005.x. When I access that, I get HTTP 403 Forbidden. But I expect to get HTTP 403 for a paywalled resource! The gem should not be reporting a failure there.
So this needs a smarter treatment of possible HTTP codes. Really, the only case where a URI is invalid is (I think) 404 or 50x. But I don't want to do this, I want someone else to do this, that is familiar with HTTP codes and paywalled content and redirects.
I do not agree with Ronald that a new gem is required, but I'm asking that someone else handles this. For now, I'm doing a hotfix that passes all URIs it sees.