ROR entry landing page return status code 200, although ROR does not exist

uschindler commented 1 year ago

When you have links like https://ror.org/032e6b941 (non existent ROR), the website returns an error page: "Oops, something went wrong retrieving this ROR ID https://ror.org/032e6b942"; but the HTTP status code is "200".

This seems to be the case for every wrong URL below https://ror.org/, so also the normal "404 not found page" of the ROR website, shows a web page with "404" in the title, but the returned status code is 200. For example, try: https://ror.org/foobar

This makes link checking impossible. It is also against common web priciples. Google can detect this as "Soft 404", but hey: fix this!

uschindler commented 1 year ago

Output of CURL with HTTP headers:

$ curl -I https://ror.org/032e6b941
HTTP/2 200
content-type: text/html; charset=UTF-8
content-length: 2128
date: Thu, 16 Mar 2023 14:42:57 GMT
last-modified: Wed, 08 Mar 2023 16:57:18 GMT
etag: "2a2e70823a405f67af8b0d90e36f381c"
x-amz-server-side-encryption: AES256
x-amz-meta-etag: xkOkx/8PJ0LbAqwkB5tI7w==
content-encoding: gzip
x-amz-version-id: 8VsPTA4VXpZxnjiuG_j5_QNUhH_IcGdp
accept-ranges: bytes
server: AmazonS3
x-cache: Miss from cloudfront
via: 1.1 bdb480ba487636e194d63f984ed846f2.cloudfront.net (CloudFront)
x-amz-cf-pop: TXL50-P1
x-amz-cf-id: nm19tcmdqlmuqD_0BNEIPoj7vxp_QAWUw1DLDL5PpI-iLbr5RFZRHg==

$ curl -I https://ror.org/foobar
HTTP/2 200
content-type: text/html; charset=UTF-8
content-length: 3906
date: Thu, 16 Mar 2023 14:43:04 GMT
x-amz-meta-etag: m8SzwDXDsf0eIVas6hTRgQ==
content-encoding: gzip
last-modified: Wed, 08 Mar 2023 16:53:10 GMT
etag: "9bc4b3c035c3b1fd1e2156acea14d181"
server: AmazonS3
x-cache: Error from cloudfront
via: 1.1 cd23c1917193b2e0c41e6fae756e0912.cloudfront.net (CloudFront)
x-amz-cf-pop: TXL50-P1
x-amz-cf-id: JmtbWXE_NHqij-5giCr7qTcwZvvEFTD8rD4b3_GVDg3AT6YlDlkF8w==

lizkrznarich commented 1 year ago

Hi @uschindler , thanks for raising this. You're right that this should return a 404, and I'll fix that. In the meantime, if you need to validate ROR IDs, you can use the API, ex curl 'https://api.ror.org/organizations/https://ror.org/032e6b941' , curl 'https://api.ror.org/organizations/ror.org/032e6b941' or curl 'https://api.ror.org/organizations/032e6b941'.

This will return the correct HTTP status code and error message.

curl -v 'https://api.ror.org/organizations/https://ror.org/032e6b941'
*   Trying 54.246.165.14...
* TCP_NODELAY set
* Connected to api.ror.org (54.246.165.14) port 443 (#0)
* ALPN, offering h2
* ALPN, offering http/1.1
* successfully set certificate verify locations:
*   CAfile: /etc/ssl/cert.pem
  CApath: none
* TLSv1.2 (OUT), TLS handshake, Client hello (1):
* TLSv1.2 (IN), TLS handshake, Server hello (2):
* TLSv1.2 (IN), TLS handshake, Certificate (11):
* TLSv1.2 (IN), TLS handshake, Server key exchange (12):
* TLSv1.2 (IN), TLS handshake, Server finished (14):
* TLSv1.2 (OUT), TLS handshake, Client key exchange (16):
* TLSv1.2 (OUT), TLS change cipher, Change cipher spec (1):
* TLSv1.2 (OUT), TLS handshake, Finished (20):
* TLSv1.2 (IN), TLS change cipher, Change cipher spec (1):
* TLSv1.2 (IN), TLS handshake, Finished (20):
* SSL connection using TLSv1.2 / ECDHE-RSA-AES128-GCM-SHA256
* ALPN, server accepted to use h2
* Server certificate:
*  subject: CN=ror.org
*  start date: Feb 13 00:00:00 2023 GMT
*  expire date: Oct 29 23:59:59 2023 GMT
*  subjectAltName: host "api.ror.org" matched cert's "*.ror.org"
*  issuer: C=US; O=Amazon; CN=Amazon RSA 2048 M01
*  SSL certificate verify ok.
* Using HTTP2, server supports multi-use
* Connection state changed (HTTP/2 confirmed)
* Copying HTTP/2 data in stream buffer to connection buffer after upgrade: len=0
* Using Stream ID: 1 (easy handle 0x7fd27400d600)
> GET /organizations/https://ror.org/032e6b941 HTTP/2
> Host: api.ror.org
> User-Agent: curl/7.64.1
> Accept: */*
> 
* Connection state changed (MAX_CONCURRENT_STREAMS == 128)!
< HTTP/2 404 
< date: Thu, 16 Mar 2023 16:10:01 GMT
< content-type: application/json
< content-length: 64
< status: 404 Not Found
< vary: Cookie, Origin
< x-frame-options: SAMEORIGIN
< allow: GET, HEAD, OPTIONS
< x-powered-by: Phusion Passenger 6.0.7
< server: nginx/1.18.0 + Phusion Passenger 6.0.7
< 
* Connection #0 to host api.ror.org left intact
{"errors":["ROR ID 'https://ror.org/032e6b941' does not exist"]}* Closing connection 0

uschindler commented 1 year ago

Hi, Thanks for the quick feedback. I am aware that you can use the dedicated API for link checking, but this does not work for generic persistent identifiers or links. So we are just waiting for you to fix the issue instead of temporarily implementing a workaround in the software behind https://www.pangaea.de

lizkrznarich commented 1 year ago

@uschindler This is now fixed. Note that the error page in the UI has changed. Since the search UI is a SPA hosted in S3 and served via Cloudfront, the 404 response comes from Cloudfront. A Lambda@Edge script checks every origin request that begins with 0* against the ROR API and either processes it or returns the fallback Cloudfront 404 error, depending on the API response.

curl -v 'https://ror.org/032e6b941'
*   Trying 18.160.181.71...
* TCP_NODELAY set
* Connected to ror.org (18.160.181.71) port 443 (#0)
* ALPN, offering h2
* ALPN, offering http/1.1
* successfully set certificate verify locations:
*   CAfile: /etc/ssl/cert.pem
  CApath: none
* TLSv1.2 (OUT), TLS handshake, Client hello (1):
* TLSv1.2 (IN), TLS handshake, Server hello (2):
* TLSv1.2 (IN), TLS handshake, Certificate (11):
* TLSv1.2 (IN), TLS handshake, Server key exchange (12):
* TLSv1.2 (IN), TLS handshake, Server finished (14):
* TLSv1.2 (OUT), TLS handshake, Client key exchange (16):
* TLSv1.2 (OUT), TLS change cipher, Change cipher spec (1):
* TLSv1.2 (OUT), TLS handshake, Finished (20):
* TLSv1.2 (IN), TLS change cipher, Change cipher spec (1):
* TLSv1.2 (IN), TLS handshake, Finished (20):
* SSL connection using TLSv1.2 / ECDHE-RSA-AES128-GCM-SHA256
* ALPN, server accepted to use h2
* Server certificate:
*  subject: CN=ror.org
*  start date: Feb 28 00:00:00 2023 GMT
*  expire date: Nov  2 23:59:59 2023 GMT
*  subjectAltName: host "ror.org" matched cert's "ror.org"
*  issuer: C=US; O=Amazon; CN=Amazon RSA 2048 M01
*  SSL certificate verify ok.
* Using HTTP2, server supports multi-use
* Connection state changed (HTTP/2 confirmed)
* Copying HTTP/2 data in stream buffer to connection buffer after upgrade: len=0
* Using Stream ID: 1 (easy handle 0x7ff21a00c600)
> GET /032e6b941 HTTP/2
> Host: ror.org
> User-Agent: curl/7.64.1
> Accept: */*
> 
* Connection state changed (MAX_CONCURRENT_STREAMS == 128)!
< HTTP/2 404 
< content-type: text/html; charset=UTF-8
< content-length: 3925
< date: Wed, 29 Mar 2023 18:18:13 GMT
< x-amz-meta-etag: 9exuJRuNg/JzbQJaf8s4yQ==
< content-encoding: gzip
< last-modified: Tue, 28 Mar 2023 14:36:42 GMT
< etag: "dd59e1b2417ffd8383be4960bf201921"
< server: AmazonS3
< x-cache: Error from cloudfront
< via: 1.1 c77aeab3024e2cd98690f252e49562ac.cloudfront.net (CloudFront)
< x-amz-cf-pop: MSP50-P2
< x-amz-cf-id: GiIvh84P2Or8nWEfoWzGgJemTLIJ7ng8DmYBkbwmRI05LvwYxxQfyg==
< 
Warning: Binary output can mess up your terminal. Use "--output -" to tell 
Warning: curl to output it to your terminal anyway, or consider "--output 
Warning: <FILE>" to save to a file.
* Failed writing body (0 != 3925)
* stopped the pause stream!
* Connection #0 to host ror.org left intact
* Closing connection 0

https://github.com/ror-community/ror-app/pull/187 https://github.com/ror-community/new-deployment/pull/123 https://github.com/ror-community/new-deployment/pull/121

uschindler commented 1 year ago

Hi @lizkrznarich, Thanks you very much. It looks much better now.

Maybe change the Cloudflare-generated 404 page a bit to mention that the ROR is not valid. Or is it only possible to have a single 404 page for a whole domain?

Uwe

lizkrznarich commented 1 year ago

@uschindler We're limited to one default Cloudfront 404 configuration, since the info site and search app share the same domain. I agree that a specific 404 message would be better for the ROR ID not found case, however, there are limitations as to what you can alter in the request and response in a single Lambda@Edge function. Ex: you can either return a custom HTTP response OR alter a request and forward it on to a given origin in a single Lambda@Edge function, but you can't conditionally include both outcomes. You can, however, forward a request to the origin or allow it to fall back to the default error response in a single function, so that's the compromise I've implemented. I'm looking at chaining together multiple functions, though I'm a bit concerned that it will increase latency and cost.

uschindler commented 1 year ago

Another idea would be to include a bit of JavaScript into the 404 page that looks if URI path starts with "/0" and injects another text.

Anyways, I am fine with the current state.

ror-community / ror-roadmap

ROR entry landing page return status code 200, although ROR does not exist #153