pulibrary / princeton_ansible

Ansible Roles and Playbooks for Princeton University Library
10 stars 4 forks source link

[nginx] Nginxplus should pass through the error status code provided by the upstream server instead of a generic 200. #2227

Closed hackartisan closed 1 year ago

hackartisan commented 3 years ago

During an outage of many of our machines, nginx appears to have returned the stock error page with a 200 status. The underlying error appears to be:

2021/04/26 15:03:09 [error] 22494#22494: *12797085 no live upstreams while connecting to upstream, client: 128.112.202.197, server: findingaids-beta.princeton.edu, request: "GET /catalog/CO2_trunks_E-B+.json HTTP/1.1", upstream: "http://pulfalight-prod/catalog/CO2_trunks_E-B+.json", host: "findingaids-beta.princeton.edu"

This should return as a non-200. If it had returned a non-200 figgy would have appropriately interpreted it as an error (see https://github.com/pulibrary/figgy/blob/186fadca59e3dfea86068d53cfbb669334373d57/app/services/pul_metadata_services/client.rb#L26) However, it tried to parse it as JSON. Relevant honeybadger issue: https://app.honeybadger.io/projects/53391/faults/79044416

some datadog logs: https://app.datadoghq.com/logs?context_event=AQAAAXkOtKOwCb6NGQAAAABBWGtPdEt5TUFBQVBYR09RaVVYbWJnQVI&event=AQAAAXkOtJ_ICb6NDQAAAABBWGtPdEt5TUFBQVBYR09RaVVYbWJnQUY&from_ts=1618931463021&index=&live=false&messageDisplay=inline&query=host%3Alib-adc1%20service%3Anginxplus%20json&stream_sort=desc&to_ts=1619449393000&viz=

rlskoeser commented 3 years ago

I'm seeing similar behavior with CDH staging sites that are IP restricted, and suspect it may also be happening with errors being served to google and bing search bots (google search console is reporting a suspiciously large number of "soft 404s" for the CDH website, meaning content is not found but it was not served with a 404 or other status).

You can easily duplicate the problem by using curl on an IP restricted site while not on VPN, e.g.:

curl -I https://test-geniza.cdh.princeton.edu

HTTP/2 200
server: nginx/1.19.10
date: Mon, 07 Jun 2021 16:43:00 GMT
content-type: text/html
content-length: 2489
last-modified: Fri, 04 Jun 2021 20:39:29 GMT
etag: "60ba8f81-9b9"
accept-ranges: bytes

I see that both prod and staging maintenance.conf @maintenance locations include a =502 directive.

Based on this stackoverflow answer I think maybe it is returning 200 because the try_files succeeds on finding index.html, although that's not clear to me from the nginx documentation for try_files

rlskoeser commented 3 years ago

Possibly related to this change: I'd like to see a cache-control header that sets a shorter life for this page. Right now my browser caches it and I have to do a hard-refresh to get the real content instead of the maintenance page. (Currently it has last-modified and etag headers which are probably based on the html file, but that aren't really relevant for an error/maintenance page.)

Most often this problem happens when I try to access something that requires vpn before I'm actually on vpn: try to access the site, get the maintenance page, login to vpn, reload the page — still maintenance page; hard refresh — see the actual content.

acozine commented 2 years ago

Nginxplus should pass through the error status code provided by the upstream server instead of a generic 200.

acozine commented 1 year ago

We ran into this challenge again with the recent outage of the approvals site. If we can't pass the underlying error through, can we at least pass something that's not a 200? Even a 500 error would be an improvement.

acozine commented 1 year ago

Also related to #4073 because we'd like to return a custom page for anyone who gets rate-limited. And so also related to https://github.com/pulibrary/orangelight/issues/3638.

acozine commented 1 year ago

An expert from F5 showed me how to configure this correctly.

We just needed to delete the =500, including the =, and then any site config that uses errors.conf (instead of <env>-maintenance.conf) and has proxy_intercept_errors on; in its location definition will pass any error from the upstream server to the client.

I know @rlskoeser doesn't want to use this functionality on all sites, or possibly for all error codes. And I plan to create an option for this in the ansible template for nginx site configs. So there's more work to do to make this flexible. But now we understand how to do it!

rlskoeser commented 1 year ago

Congratulations again @acozine for solving this! 🌮 🎉 👏

Documenting here what I mentioned on Slack:

I don't want the generic message to override anything we have customized at the application level; for us that's usually 404 and 500, but I think 401 and 403 may have django-specific behaviors we'd want to preserve, depending on the application. I'd be glad to coordinate with you at some point to test integrating this new setup with one of my test sites so we can figure out a good configuration recipe that will work across CDH sites.

acozine commented 1 year ago

Closed by #4121, #4122, and #4140.

Remaining related work detailed in https://github.com/Princeton-CDH/cdh-ansible/issues/60 and in #4142.