Open bswhite opened 4 years ago
We can probably do something with the httr (R) or requests (Python) library, or even something similar in either language for just executing curl
commands. It's a little tricky for a lot of NCBI links because of how the search is performed. For example:
Checking a valid URL...
curl -IL https://www.ncbi.nlm.nih.gov/pubmed/?term=28053997
... returns a success code (200
).
curl -I https://www.ncbi.nlm.nih.gov/pubmed/?term=28053997
HTTP/2 200
date: Thu, 14 May 2020 16:36:04 GMT
server: Apache
strict-transport-security: max-age=31536000; includeSubDomains; preload
referrer-policy: origin-when-cross-origin
content-security-policy: upgrade-insecure-requests
cache-control: private
ncbi-phid: CE8867BDEBD703210000000001B200CA.m_20
ncbi-sid: CE8867BDEBD73741_0434SID
content-type: text/html; charset=UTF-8
set-cookie: ncbi_sid=CE8867BDEBD73741_0434SID; domain=.nih.gov; path=/; expires=Fri, 14 May 2021 16:36:04 GMT
set-cookie: WebEnv=1jWBvejKuqmKqJfkZHBaeXOuSBx5l_EJncFD1GqqWiEykyeRfwR9HNo-kXpt7VAVEt0xc-UPnKdWeONOrPKux5QW3TQ20ogwJYlkT%40CE8867BDEBD73741_0434SID; domain=.nlm.nih.gov; path=/; expires=Fri, 15 May 2020 00:36:04 GMT
x-ua-compatible: IE=Edge
x-xss-protection: 1; mode=block
A clearly invalid URL...
curl -IL https://www.ncbi.nlm.nih.gov/puffmed/?term=28053997
... returns a "not found" code (404
):
HTTP/2 404
date: Thu, 14 May 2020 16:40:18 GMT
server: Apache
strict-transport-security: max-age=31536000; includeSubDomains; preload
referrer-policy: origin-when-cross-origin
content-security-policy: upgrade-insecure-requests
accept-ranges: bytes
vary: Accept-Encoding
x-ua-compatible: IE=Edge
x-xss-protection: 1; mode=block
content-type: text/html
However, a URL with a bogus accession term ...
curl -I https://www.ncbi.nlm.nih.gov/pubmed/?term=2805foobar3997
... looks like it works?
HTTP/2 200
date: Thu, 14 May 2020 16:44:26 GMT
server: Apache
strict-transport-security: max-age=31536000; includeSubDomains; preload
referrer-policy: origin-when-cross-origin
content-security-policy: upgrade-insecure-requests
cache-control: private
ncbi-phid: CE8841A6EBD6DF2100000000040201E0.m_19
ncbi-sid: CE8841A6EBD756A1_1026SID
content-type: text/html; charset=UTF-8
set-cookie: ncbi_sid=CE8841A6EBD756A1_1026SID; domain=.nih.gov; path=/; expires=Fri, 14 May 2021 16:44:26 GMT
set-cookie: WebEnv=1X37bUQ6sKhr7yeZU_F298Y9GELtxmOYqXhEhiHSA7A-QfRmRhyprL-WRFN-b2CGY_q6EWsTlNDO8PR9ys31PQzOXx1bCweD5eU5Q%40CE8841A6EBD756A1_1026SID; domain=.nlm.nih.gov; path=/; expires=Fri, 15 May 2020 00:44:27 GMT
x-ua-compatible: IE=Edge
x-xss-protection: 1; mode=block
But if you dig into the body of the response, you'll see the title "No items found." So we might need to check the details for all NCBI links...
@mc2-center/triage-team another nice-to-have. Maybe we can add this to our high-level Jira board too.
Added to Jira here: https://sagebionetworks.jira.com/browse/CPO-288
@andrewelamb is there a way to check that external links are valid / not broken?
This includes: