ome / ome-model

OME model (specification, code generator, implementation)
Other
13 stars 26 forks source link

Linkcheck fixes #116

Closed sbesson closed 4 years ago

sbesson commented 4 years ago

Background: the NCBI PMC HEAD requests have some user-based agent filtering into place and reject the default agent set by Sphinx:

>>> requests.head('https://www.ncbi.nlm.nih.gov/pmc/articles/PMC6774793/')
<Response [403]>
>>> requests.head('https://www.ncbi.nlm.nih.gov/pmc/articles/PMC6774793/', headers={'User-Agent': 'Mozilla/5.0'})
<Response [200]>
>>> requests.head('https://www.ncbi.nlm.nih.gov/pmc/articles/PMC6774793/', headers={'User-Agent': 'Sphinx/3.1.2 requests/2.23.0 python/3.7.6'})
<Response [403]>
>>> requests.head('https://www.ncbi.nlm.nih.gov/pmc/articles/PMC6774793/', headers={'User-Agent': 'Mozilla/4.0'})
<Response [200]>
sbesson commented 4 years ago

Looking at the help pages for PMC, the source of the issue is that the Sphinx requests are probably considered as violating https://www.ncbi.nlm.nih.gov/pmc/about/copyright/ esp. Crawlers and other automated processes may NOT be used to systematically retrieve batches of articles from the PMC web site. Bulk downloading of articles from the main PMC web site, in any way, is prohibited because of copyright restrictions..

An alternate solution is probably to add https://www.ncbi.nlm.nih.gov/pmc/articles/.* to the ignore list and trust the canonical URL will not be broken by the resource.

sbesson commented 4 years ago

See https://merge-ci.openmicroscopy.org/jenkins/job/OME-MODEL-linkcheck/7/

joshmoore commented 4 years ago

Changes (though remarkable) all make sense. Job looks good. :+1: