Closed sbesson closed 4 years ago
Looking at the help pages for PMC, the source of the issue is that the Sphinx requests are probably considered as violating https://www.ncbi.nlm.nih.gov/pmc/about/copyright/ esp. Crawlers and other automated processes may NOT be used to systematically retrieve batches of articles from the PMC web site. Bulk downloading of articles from the main PMC web site, in any way, is prohibited because of copyright restrictions.
.
An alternate solution is probably to add https://www.ncbi.nlm.nih.gov/pmc/articles/.*
to the ignore list and trust the canonical URL will not be broken by the resource.
Changes (though remarkable) all make sense. Job looks good. :+1:
Background: the NCBI PMC HEAD requests have some user-based agent filtering into place and reject the default agent set by Sphinx: