pulibrary / princeton_ansible

Ansible Roles and Playbooks for Princeton University Library
10 stars 2 forks source link

[nginxplus] Why did upgrading app_protect break our nginx config? #4893

Open acozine opened 4 months ago

acozine commented 4 months ago

In a recent incident, we brought our load balancers down by upgrading the apt package app_protect. See this incident doc.

We needed to get the production load balancers back up quickly, so we stopped using app_protect in production. We were only using the package on a few staging servers. We manually updated nginx.conf on the production LBs so they would not load app_protect, then we commented it out in the individual site configs in #4867.

Now that we have dev/test/staging load balancers, let's investigate what happened and how to get app_protect working again. Why did the upgrade break our existing configuration? What configuration changes would be needed to use app_protect successfully with the latest apt version?

kayiwa commented 3 months ago

I was able to see this live today.

Setting up nginx-plus (32-1~jammy) ...
{
  "softwareVersion": "4.10.0",
  "componentVersions": {
    "wafEngineVersion": "11.48.0",
    "wafNginxVersion": "5.48.0"
  },
  "error_message": "Bot Signature File update failed. Error: Failed to unpack /opt/app_protect/var/update_files/bot_signatures/bot_signatures.bin.tgz: 'tar (child): gzip: Cannot exec: No such file or directory\ntar (child): Error is not recoverable: exiting now\n/bin/tar: Child returned status 2\n/bin/tar: Error is not recoverable: exiting now\n'",
  "completed_successfully": false,
  "event": "configuration_load_failure"
}
nginx: configuration file /etc/nginx/nginx.conf test failed
invoke-rc.d: initscript nginx, action "upgrade" failed.
kayiwa commented 1 week ago

yet another... nginx dumping core

● nginx.service - NGINX Plus - high performance web server
     Loaded: loaded (/lib/systemd/system/nginx.service; enabled; vendor preset: enabled)
     Active: failed (Result: core-dump) since Wed 2024-08-28 16:27:31 UTC; 5 days ago
       Docs: https://www.nginx.com/resources/
   Main PID: 3889193 (code=dumped, signal=SEGV)

Aug 27 18:39:53 lib-adc1 nginx[3889184]: nginx: [warn] conflicting server name "cdh-test-d>
Aug 27 18:39:53 lib-adc1 nginx[3889184]: nginx: [warn] could not build optimal server_name>
Aug 27 18:39:53 lib-adc1 systemd[1]: Started NGINX Plus - high performance web server.
Aug 28 14:55:12 lib-adc1 systemd[1]: Reloading NGINX Plus - high performance web server.
Aug 28 14:55:12 lib-adc1 systemd[1]: Reloaded NGINX Plus - high performance web server.
Aug 28 16:27:17 lib-adc1 systemd[1]: Reloading NGINX Plus - high performance web server.
Aug 28 16:27:17 lib-adc1 systemd[1]: Reloaded NGINX Plus - high performance web server.
Aug 28 16:27:31 lib-adc1 systemd[1]: nginx.service: Main process exited, code=dumped, stat>
Aug 28 16:27:31 lib-adc1 systemd[1]: nginx.service: Failed with result 'core-dump'.
Aug 28 16:33:08 lib-adc1 systemd[1]: nginx.service: Unit cannot be reloaded because it is
kayiwa commented 1 week ago

fixed by starting it with

sudo systemctl start nginx