mc2-center / mc2-center-dcc

Data coordination resources for CCKP (and MC2 in general)
0 stars 0 forks source link

automated validation of external links #2

Open bswhite opened 4 years ago

bswhite commented 4 years ago

@andrewelamb is there a way to check that external links are valid / not broken?

This includes:

  1. The Pubmed Link for publications.
  2. Each entry in the comma-separated list of External Link for datasets.
  3. The title (currently) of the tools card -- though eventually we probably want this title to link to a tools detail page for consistency with the behavior of clicking on other card titles.
jaeddy commented 4 years ago

We can probably do something with the httr (R) or requests (Python) library, or even something similar in either language for just executing curl commands. It's a little tricky for a lot of NCBI links because of how the search is performed. For example:

Checking a valid URL...

curl -IL https://www.ncbi.nlm.nih.gov/pubmed/?term=28053997

... returns a success code (200).

curl -I https://www.ncbi.nlm.nih.gov/pubmed/?term=28053997
HTTP/2 200
date: Thu, 14 May 2020 16:36:04 GMT
server: Apache
strict-transport-security: max-age=31536000; includeSubDomains; preload
referrer-policy: origin-when-cross-origin
content-security-policy: upgrade-insecure-requests
cache-control: private
ncbi-phid: CE8867BDEBD703210000000001B200CA.m_20
ncbi-sid: CE8867BDEBD73741_0434SID
content-type: text/html; charset=UTF-8
set-cookie: ncbi_sid=CE8867BDEBD73741_0434SID; domain=.nih.gov; path=/; expires=Fri, 14 May 2021 16:36:04 GMT
set-cookie: WebEnv=1jWBvejKuqmKqJfkZHBaeXOuSBx5l_EJncFD1GqqWiEykyeRfwR9HNo-kXpt7VAVEt0xc-UPnKdWeONOrPKux5QW3TQ20ogwJYlkT%40CE8867BDEBD73741_0434SID; domain=.nlm.nih.gov; path=/; expires=Fri, 15 May 2020 00:36:04 GMT
x-ua-compatible: IE=Edge
x-xss-protection: 1; mode=block

A clearly invalid URL...

curl -IL https://www.ncbi.nlm.nih.gov/puffmed/?term=28053997

... returns a "not found" code (404):

HTTP/2 404
date: Thu, 14 May 2020 16:40:18 GMT
server: Apache
strict-transport-security: max-age=31536000; includeSubDomains; preload
referrer-policy: origin-when-cross-origin
content-security-policy: upgrade-insecure-requests
accept-ranges: bytes
vary: Accept-Encoding
x-ua-compatible: IE=Edge
x-xss-protection: 1; mode=block
content-type: text/html

However, a URL with a bogus accession term ...

curl -I https://www.ncbi.nlm.nih.gov/pubmed/?term=2805foobar3997

... looks like it works?

HTTP/2 200
date: Thu, 14 May 2020 16:44:26 GMT
server: Apache
strict-transport-security: max-age=31536000; includeSubDomains; preload
referrer-policy: origin-when-cross-origin
content-security-policy: upgrade-insecure-requests
cache-control: private
ncbi-phid: CE8841A6EBD6DF2100000000040201E0.m_19
ncbi-sid: CE8841A6EBD756A1_1026SID
content-type: text/html; charset=UTF-8
set-cookie: ncbi_sid=CE8841A6EBD756A1_1026SID; domain=.nih.gov; path=/; expires=Fri, 14 May 2021 16:44:26 GMT
set-cookie: WebEnv=1X37bUQ6sKhr7yeZU_F298Y9GELtxmOYqXhEhiHSA7A-QfRmRhyprL-WRFN-b2CGY_q6EWsTlNDO8PR9ys31PQzOXx1bCweD5eU5Q%40CE8841A6EBD756A1_1026SID; domain=.nlm.nih.gov; path=/; expires=Fri, 15 May 2020 00:44:27 GMT
x-ua-compatible: IE=Edge
x-xss-protection: 1; mode=block

But if you dig into the body of the response, you'll see the title "No items found." So we might need to check the details for all NCBI links...

vpchung commented 2 years ago

@mc2-center/triage-team another nice-to-have. Maybe we can add this to our high-level Jira board too.

aclayton555 commented 2 years ago

Added to Jira here: https://sagebionetworks.jira.com/browse/CPO-288