phusion / passenger

A fast and robust web server and application server for Ruby, Python and Node.js
https://www.phusionpassenger.com/
MIT License
4.98k stars 547 forks source link

Passenger Enterprise zero-downtime restarts & request routing can prolong outages and delay recovery #2551

Open dannyfallon opened 1 week ago

dannyfallon commented 1 week ago

Please follow the steps below if you have found an issue with Passenger, or would like to request a feature.

Problem Statement

Passenger's request routing prioritises the least-busy process in an ordered list of processes. The worker pool presents its availability to serve requests oldest process first per application group which means of the available workers the selection will always favour the oldest one. It's explained this is for application-level caching, memory optimisation, JIT etc.

This starts to break down during a zero downtime restart. Once we get past the swapping of the application preloader and ensure we're not going to resist the deployment (i.e. the new code can boot/ack), it's time to roll some processes. Passenger picks the oldest worker in the pool - our hottest worker - first. When it replaces this process, the new worker is put at the end of the worker pool. The cycle repeats until all old workers are replaced.

Because the Passenger routing algorithm has no concept of new workers vs old workers and remains biasing to the oldest worker first two things happen:

  1. Request during the deployment are being given to processes running old code that will die very soon. Unless workers are at 100% utilisation regularly, these processes serving these requests have a bunch of cache warming and JIT stuff to do which is thrown away within the next few minutes.
  2. The new code is delayed coming into service unless Passenger's pool of workers is 100% utilised regularly. In times of application issues and outages where the resolution is to deploy another version of the application Passenger can prolong recovery by not biasing towards the newest processes.

Expected behaviour

Given that the desired end state is that we want the newer application code running, I expect Passenger to bias towards new workers where available.

Actual behaviour

Passenger continues to serve requests to older processes running old code. These processes are potentially, or in our case very likely, cold in terms of application caching/YJIT etc.

Potential solution

If Passenger had a concept of an application version (monotonically increasing, starting at 1), then the order of the pool of workers could be version DESC, age/PID ASC when routing requests. Triggering a zero-downtime restart would increase the application version and therefore processes started after the restart was triggered would be preferred, in oldest PID order to maintain the optimisations Passenger desires for application-level caching.

For a little while during restarting Passenger probably won't have enough application processes on the new version to cope with the full request load so some will fall over into the older version. I'm not married to this idea but it might be possible to allow for Passenger to maintain hot old processes if it chose the process to kill during restarts as previous_version, age/PID descending - the newest processes on the older version. This of course comes at a cost - the memory used by the hot older processes won't be freed up until they are killed last and meanwhile the newer version's hot processes are warming up so you could run into memory issues; but if you've enough free memory going to cope with it it would prevent the churn on otherwise mostly idle older processes.

Context

We're running a Rails monolith on Ruby 3.3.3 on EC2 using Flying Passenger. We run 96 passenger processes in our main pool per instance. Generally around peak Passenger's busyness is ~50%:

image

We run more workers than we need in a static pool. This helps us to smooth out small request spikes as we've got some capacity ready to go and don't need to wait for workers to boot (cold caches notwithstanding). While we have request queues enabled, they're generally empty:

image

Here's a normal Passenger status output:

Version : 6.0.18
Date    : 2024-06-21 13:46:43 +0000
Instance: w4jxxfI1 (Flying Passenger 6.0.18)

----------- General information -----------
Max pool size : 512
App groups    : 2
Processes     : 99
Requests in top-level queue : 0

----------- Application groups -----------
main_app:
  App root: /apps/intercom/current
  Requests in queue: 0
  * PID: 1056482   Sessions: 1       Processed: 5908    Uptime: 25m 55s
    CPU: 73%     Memory  : 1036M   Last used: 0s ago
  * PID: 1056542   Sessions: 1       Processed: 5707    Uptime: 25m 53s
    CPU: 70%     Memory  : 1188M   Last used: 0s ago
  * PID: 1056605   Sessions: 1       Processed: 5739    Uptime: 25m 51s
    CPU: 70%     Memory  : 1055M   Last used: 0s ago
  * PID: 1056655   Sessions: 1       Processed: 5875    Uptime: 25m 48s
    CPU: 73%     Memory  : 1152M   Last used: 0s ago
  * PID: 1056773   Sessions: 1       Processed: 5897    Uptime: 25m 46s
    CPU: 72%     Memory  : 1034M   Last used: 0s ago
  * PID: 1056890   Sessions: 1       Processed: 5766    Uptime: 25m 44s
    CPU: 72%     Memory  : 1169M   Last used: 0s ago
  * PID: 1056951   Sessions: 1       Processed: 5913    Uptime: 25m 42s
    CPU: 73%     Memory  : 1132M   Last used: 0s ago
  * PID: 1057044   Sessions: 1       Processed: 5617    Uptime: 25m 40s
    CPU: 71%     Memory  : 1157M   Last used: 0s ago
  * PID: 1057087   Sessions: 1       Processed: 5794    Uptime: 25m 37s
    CPU: 72%     Memory  : 1228M   Last used: 0s ago
  * PID: 1057191   Sessions: 1       Processed: 5560    Uptime: 25m 35s
    CPU: 71%     Memory  : 1145M   Last used: 0s ago
  * PID: 1057369   Sessions: 1       Processed: 5991    Uptime: 25m 33s
    CPU: 73%     Memory  : 1092M   Last used: 0s ago
  * PID: 1057473   Sessions: 1       Processed: 5511    Uptime: 25m 31s
    CPU: 70%     Memory  : 1015M   Last used: 0s ago
  * PID: 1057564   Sessions: 1       Processed: 5970    Uptime: 25m 29s
    CPU: 70%     Memory  : 968M    Last used: 0s ago
  * PID: 1057655   Sessions: 1       Processed: 5586    Uptime: 25m 27s
    CPU: 71%     Memory  : 999M    Last used: 1s ago
  * PID: 1057782   Sessions: 1       Processed: 6079    Uptime: 25m 25s
    CPU: 72%     Memory  : 968M    Last used: 0s ago
  * PID: 1057833   Sessions: 1       Processed: 5851    Uptime: 25m 23s
    CPU: 71%     Memory  : 1165M   Last used: 0s ago
  * PID: 1057958   Sessions: 1       Processed: 5630    Uptime: 25m 21s
    CPU: 70%     Memory  : 1163M   Last used: 1s ago
  * PID: 1058050   Sessions: 1       Processed: 5303    Uptime: 25m 19s
    CPU: 69%     Memory  : 975M    Last used: 0s ago
  * PID: 1058278   Sessions: 0       Processed: 5361    Uptime: 25m 16s
    CPU: 68%     Memory  : 1085M   Last used: 0s ago
  * PID: 1058428   Sessions: 1       Processed: 5521    Uptime: 25m 14s
    CPU: 69%     Memory  : 1027M   Last used: 0s ago
  * PID: 1058492   Sessions: 1       Processed: 5759    Uptime: 25m 12s
    CPU: 70%     Memory  : 1046M   Last used: 0s ago
  * PID: 1058623   Sessions: 1       Processed: 5436    Uptime: 25m 8s
    CPU: 70%     Memory  : 1024M   Last used: 12s ago
  * PID: 1058733   Sessions: 1       Processed: 5556    Uptime: 25m 4s
    CPU: 71%     Memory  : 1005M   Last used: 0s ago
  * PID: 1058807   Sessions: 1       Processed: 5599    Uptime: 25m 2s
    CPU: 70%     Memory  : 1034M   Last used: 2s ago
  * PID: 1058911   Sessions: 1       Processed: 5598    Uptime: 24m 59s
    CPU: 70%     Memory  : 1153M   Last used: 0s ago
  * PID: 1058999   Sessions: 1       Processed: 5533    Uptime: 24m 57s
    CPU: 68%     Memory  : 975M    Last used: 0s ago
  * PID: 1059170   Sessions: 1       Processed: 5191    Uptime: 24m 55s
    CPU: 70%     Memory  : 1120M   Last used: 0s ago
  * PID: 1059303   Sessions: 1       Processed: 5431    Uptime: 24m 52s
    CPU: 69%     Memory  : 1070M   Last used: 0s ago
  * PID: 1059386   Sessions: 1       Processed: 5286    Uptime: 24m 50s
    CPU: 68%     Memory  : 1211M   Last used: 0s ago
  * PID: 1059474   Sessions: 1       Processed: 5262    Uptime: 24m 48s
    CPU: 67%     Memory  : 1062M   Last used: 0s ago
  * PID: 1059528   Sessions: 1       Processed: 5196    Uptime: 24m 46s
    CPU: 69%     Memory  : 1150M   Last used: 0s ago
  * PID: 1059686   Sessions: 1       Processed: 5098    Uptime: 24m 44s
    CPU: 68%     Memory  : 1142M   Last used: 0s ago
  * PID: 1059827   Sessions: 1       Processed: 5081    Uptime: 24m 41s
    CPU: 68%     Memory  : 990M    Last used: 0s ago
  * PID: 1059948   Sessions: 1       Processed: 5277    Uptime: 24m 39s
    CPU: 69%     Memory  : 1053M   Last used: 0s ago
  * PID: 1060276   Sessions: 1       Processed: 5144    Uptime: 24m 33s
    CPU: 66%     Memory  : 1012M   Last used: 0s ago
  * PID: 1060337   Sessions: 1       Processed: 5215    Uptime: 24m 31s
    CPU: 67%     Memory  : 976M    Last used: 0s ago
  * PID: 1060462   Sessions: 1       Processed: 5086    Uptime: 24m 29s
    CPU: 66%     Memory  : 1061M   Last used: 0s ago
  * PID: 1060640   Sessions: 1       Processed: 4986    Uptime: 24m 27s
    CPU: 62%     Memory  : 987M    Last used: 0s ago
  * PID: 1060731   Sessions: 1       Processed: 4567    Uptime: 24m 24s
    CPU: 63%     Memory  : 1163M   Last used: 1s ago
  * PID: 1060800   Sessions: 1       Processed: 4503    Uptime: 24m 22s
    CPU: 62%     Memory  : 984M    Last used: 1s ago
  * PID: 1060866   Sessions: 1       Processed: 4217    Uptime: 24m 20s
    CPU: 57%     Memory  : 1100M   Last used: 1s ago
  * PID: 1060955   Sessions: 1       Processed: 3763    Uptime: 24m 18s
    CPU: 52%     Memory  : 1120M   Last used: 0s ago
  * PID: 1061052   Sessions: 1       Processed: 3407    Uptime: 24m 15s
    CPU: 44%     Memory  : 992M    Last used: 0s ago
  * PID: 1061127   Sessions: 1       Processed: 2740    Uptime: 24m 11s
    CPU: 37%     Memory  : 975M    Last used: 0s ago
  * PID: 1061175   Sessions: 0       Processed: 1930    Uptime: 24m 9s
    CPU: 29%     Memory  : 896M    Last used: 0s ago
  * PID: 1061246   Sessions: 0       Processed: 1407    Uptime: 24m 6s
    CPU: 21%     Memory  : 1036M   Last used: 1s ago
  * PID: 1061370   Sessions: 0       Processed: 868     Uptime: 24m 5s
    CPU: 12%     Memory  : 793M    Last used: 1s ago
  * PID: 1061591   Sessions: 0       Processed: 472     Uptime: 24m 2s
    CPU: 7%      Memory  : 721M    Last used: 17s ago
  * PID: 1061754   Sessions: 0       Processed: 228     Uptime: 24m 0s
    CPU: 3%      Memory  : 594M    Last used: 23s ago
  * PID: 1061831   Sessions: 0       Processed: 102     Uptime: 23m 58s
    CPU: 2%      Memory  : 557M    Last used: 43s ago
  * PID: 1061910   Sessions: 0       Processed: 51      Uptime: 23m 56s
    CPU: 1%      Memory  : 501M    Last used: 43s ago
  * PID: 1061993   Sessions: 0       Processed: 23      Uptime: 23m 54s
    CPU: 1%      Memory  : 493M    Last used: 3m 27s ag
  * PID: 1062069   Sessions: 0       Processed: 9       Uptime: 23m 51s
    CPU: 0%      Memory  : 405M    Last used: 10m 33s a
  * PID: 1062119   Sessions: 0       Processed: 1       Uptime: 23m 49s
    CPU: 0%      Memory  : 360M    Last used: 16m 19s a
  * PID: 1062281   Sessions: 0       Processed: 1       Uptime: 23m 48s
    CPU: 0%      Memory  : 350M    Last used: 16m 19s a
  * PID: 1062394   Sessions: 0       Processed: 0       Uptime: 23m 45s
    CPU: 0%      Memory  : 264M    Last used: 23m 45s a
  * PID: 1062572   Sessions: 0       Processed: 0       Uptime: 23m 43s
    CPU: 0%      Memory  : 264M    Last used: 23m 43s a
  * PID: 1062775   Sessions: 0       Processed: 0       Uptime: 23m 41s
    CPU: 0%      Memory  : 264M    Last used: 23m 41s a
  * PID: 1062908   Sessions: 0       Processed: 0       Uptime: 23m 39s
    CPU: 0%      Memory  : 264M    Last used: 23m 39s a
  * PID: 1062980   Sessions: 0       Processed: 0       Uptime: 23m 35s
    CPU: 0%      Memory  : 264M    Last used: 23m 35s a
  * PID: 1063056   Sessions: 0       Processed: 0       Uptime: 23m 33s
    CPU: 0%      Memory  : 264M    Last used: 23m 33s a
  * PID: 1063175   Sessions: 0       Processed: 0       Uptime: 23m 30s
    CPU: 0%      Memory  : 264M    Last used: 23m 30s a
  * PID: 1063289   Sessions: 0       Processed: 0       Uptime: 23m 28s
    CPU: 0%      Memory  : 264M    Last used: 23m 28s a
  * PID: 1063339   Sessions: 0       Processed: 0       Uptime: 23m 26s
    CPU: 0%      Memory  : 264M    Last used: 23m 26s a
  * PID: 1063399   Sessions: 0       Processed: 0       Uptime: 23m 24s
    CPU: 0%      Memory  : 264M    Last used: 23m 24s a
  * PID: 1063475   Sessions: 0       Processed: 0       Uptime: 23m 22s
    CPU: 0%      Memory  : 264M    Last used: 23m 22s a
  * PID: 1063526   Sessions: 0       Processed: 0       Uptime: 23m 19s
    CPU: 0%      Memory  : 264M    Last used: 23m 19s a
  * PID: 1063684   Sessions: 0       Processed: 0       Uptime: 23m 13s
    CPU: 0%      Memory  : 264M    Last used: 23m 13s a
  * PID: 1063722   Sessions: 0       Processed: 0       Uptime: 23m 11s
    CPU: 0%      Memory  : 264M    Last used: 23m 11s a
  * PID: 1063853   Sessions: 0       Processed: 0       Uptime: 23m 9s
    CPU: 0%      Memory  : 264M    Last used: 23m 9s ago
  * PID: 1063980   Sessions: 0       Processed: 0       Uptime: 23m 7s
    CPU: 0%      Memory  : 264M    Last used: 23m 7s ago
  * PID: 1064055   Sessions: 0       Processed: 0       Uptime: 23m 5s
    CPU: 0%      Memory  : 264M    Last used: 23m 5s ago
  * PID: 1064253   Sessions: 0       Processed: 0       Uptime: 23m 3s
    CPU: 0%      Memory  : 264M    Last used: 23m 3s ago
  * PID: 1064352   Sessions: 0       Processed: 0       Uptime: 23m 1s
    CPU: 0%      Memory  : 264M    Last used: 23m 1s ago
  * PID: 1064518   Sessions: 0       Processed: 0       Uptime: 22m 59s
    CPU: 0%      Memory  : 264M    Last used: 22m 59s a
  * PID: 1064572   Sessions: 0       Processed: 0       Uptime: 22m 57s
    CPU: 0%      Memory  : 264M    Last used: 22m 57s a
  * PID: 1064646   Sessions: 0       Processed: 0       Uptime: 22m 55s
    CPU: 0%      Memory  : 264M    Last used: 22m 55s a
  * PID: 1064775   Sessions: 0       Processed: 0       Uptime: 22m 53s
    CPU: 0%      Memory  : 264M    Last used: 22m 53s a
  * PID: 1064849   Sessions: 0       Processed: 0       Uptime: 22m 51s
    CPU: 0%      Memory  : 264M    Last used: 22m 51s a
  * PID: 1064945   Sessions: 0       Processed: 0       Uptime: 22m 48s
    CPU: 0%      Memory  : 264M    Last used: 22m 48s a
  * PID: 1065050   Sessions: 0       Processed: 0       Uptime: 22m 46s
    CPU: 0%      Memory  : 264M    Last used: 22m 46s a
  * PID: 1065186   Sessions: 0       Processed: 0       Uptime: 22m 43s
    CPU: 0%      Memory  : 264M    Last used: 22m 43s a
  * PID: 1065378   Sessions: 0       Processed: 0       Uptime: 22m 41s
    CPU: 0%      Memory  : 264M    Last used: 22m 41s a
  * PID: 1065477   Sessions: 0       Processed: 0       Uptime: 22m 39s
    CPU: 0%      Memory  : 264M    Last used: 22m 39s a
  * PID: 1065551   Sessions: 0       Processed: 0       Uptime: 22m 37s
    CPU: 0%      Memory  : 264M    Last used: 22m 37s a
  * PID: 1065736   Sessions: 0       Processed: 0       Uptime: 22m 35s
    CPU: 0%      Memory  : 265M    Last used: 22m 35s a
  * PID: 1065855   Sessions: 0       Processed: 0       Uptime: 22m 33s
    CPU: 0%      Memory  : 265M    Last used: 22m 33s a
  * PID: 1065971   Sessions: 0       Processed: 0       Uptime: 22m 31s
    CPU: 0%      Memory  : 265M    Last used: 22m 31s a
  * PID: 1066129   Sessions: 0       Processed: 0       Uptime: 22m 29s
    CPU: 0%      Memory  : 265M    Last used: 22m 29s a
  * PID: 1066237   Sessions: 0       Processed: 0       Uptime: 22m 26s
    CPU: 0%      Memory  : 265M    Last used: 22m 26s a
  * PID: 1066310   Sessions: 0       Processed: 0       Uptime: 22m 24s
    CPU: 0%      Memory  : 265M    Last used: 22m 24s a
  * PID: 1066396   Sessions: 0       Processed: 0       Uptime: 22m 22s
    CPU: 0%      Memory  : 265M    Last used: 22m 22s a
  * PID: 1066444   Sessions: 0       Processed: 0       Uptime: 22m 21s
    CPU: 0%      Memory  : 265M    Last used: 22m 21s a
  * PID: 1066537   Sessions: 0       Processed: 0       Uptime: 22m 18s
    CPU: 0%      Memory  : 265M    Last used: 22m 18s a
  * PID: 1066603   Sessions: 0       Processed: 0       Uptime: 22m 16s
    CPU: 0%      Memory  : 265M    Last used: 22m 16s a
  * PID: 1068441   Sessions: 0       Processed: 0       Uptime: 20m 41s
    CPU: 0%      Memory  : 104M    Last used: 20m 41s a

The last processes are 1066603 and 1068441 sitting bone idle, 0 requests processed. A rolling restart is triggered, selecting the first process for rotation at the start of the pool:

Version : 6.0.18
Date    : 2024-06-21 13:50:09 +0000
Instance: w4jxxfI1 (Flying Passenger 6.0.18)

----------- General information -----------
Max pool size : 512
App groups    : 2
Processes     : 99
Requests in top-level queue : 0
Restarting process (pid=1056482, group=main_app) in group main_app...

----------- Application groups -----------
main_app:
  App root: /apps/intercom/current
  Requests in queue: 0
  * PID: 1056482   Sessions: 1       Processed: 6687    Uptime: 29m 21s
    CPU: 73%     Memory  : 1075M   Last used: 0s ago
    Rolling restarting...
  * PID: 1056542   Sessions: 1       Processed: 6589    Uptime: 29m 19s
    CPU: 71%     Memory  : 1141M   Last used: 0s ago
  * PID: 1056605   Sessions: 1       Processed: 6605    Uptime: 29m 17s
    CPU: 71%     Memory  : 1069M   Last used: 0s ago
  * PID: 1056655   Sessions: 1       Processed: 6802    Uptime: 29m 14s
    CPU: 73%     Memory  : 1165M   Last used: 2s ago

Towards the end of the restart you can see the request count for PID 1066603 and 1068441 is now in the hundreds and memory jumped up about 500MB a piece since they're the top processes for serving requests now if they're free.

Version : 6.0.18
Date    : 2024-06-21 13:54:27 +0000
Instance: w4jxxfI1 (Flying Passenger 6.0.18)

----------- General information -----------
Max pool size : 512
App groups    : 2
Processes     : 99
Requests in top-level queue : 0
Restarting process (pid=1066603, group=main_app) in group main_app...

----------- Application groups -----------
main_app:
  App root: /apps/intercom/current
  Requests in queue: 0
  * PID: 1066603   Sessions: 1       Processed: 411     Uptime: 30m 0s
    CPU: 4%      Memory  : 648M    Last used: 1s ago
    Rolling restarting...
  * PID: 1068441   Sessions: 1       Processed: 414     Uptime: 28m 25s
    CPU: 5%      Memory  : 726M    Last used: 1s ago
  * PID: 1086268   Sessions: 1       Processed: 358     Uptime: 6m 54s
    CPU: 20%     Memory  : 736M    Last used: 0s ago
  * PID: 1089798   Sessions: 1       Processed: 349     Uptime: 3m 44s
    CPU: 33%     Memory  : 710M    Last used: 2s ago
  * PID: 1089845   Sessions: 1       Processed: 355     Uptime: 3m 41s
    CPU: 34%     Memory  : 772M    Last used: 0s ago
  * PID: 1089899   Sessions: 1       Processed: 351     Uptime: 3m 39s
    CPU: 31%     Memory  : 755M    Last used: 3s ago
  * PID: 1089968   Sessions: 1       Processed: 288     Uptime: 3m 36s
    CPU: 31%     Memory  : 711M    Last used: 0s ago
  * PID: 1090033   Sessions: 1       Processed: 300     Uptime: 3m 33s
    CPU: 32%     Memory  : 716M    Last used: 0s ago
  * PID: 1090117   Sessions: 1       Processed: 372     Uptime: 3m 31s
    CPU: 32%     Memory  : 719M    Last used: 0s ago
  * PID: 1090229   Sessions: 1       Processed: 233     Uptime: 3m 29s
    CPU: 29%     Memory  : 707M    Last used: 0s ago
  * PID: 1090358   Sessions: 1       Processed: 334     Uptime: 3m 26s
    CPU: 30%     Memory  : 723M    Last used: 0s ago
  * PID: 1090536   Sessions: 1       Processed: 244     Uptime: 3m 24s
    CPU: 32%     Memory  : 713M    Last used: 0s ago
  * PID: 1090692   Sessions: 1       Processed: 274     Uptime: 3m 22s
    CPU: 28%     Memory  : 694M    Last used: 0s ago
dannyfallon commented 1 week ago

An real world example of Passenger's behaviour causing customer pain was https://www.intercomstatus.com/incidents/zv75pnhpgchh

We initiated a rolling restart to deploy new application code which moved one of our database to an external database provider. The red vertical is an approximation of when our deployment tool finishes its work initiating rolling restarts across our entire fleet. It's a useful guide, but not exact:

image

For the first minute or so we see no requests on the new code (purple). We begin to serve requests with the new code and our latency alarms spiked while successful request throughput drops. At this point the rolling restart was not fully finished - we've still got requests using the old, good code.

We initiate a rollback in our deployment tool. This swaps the symlink for our app back to the previous version and initiates a rolling restart again which cancels the ongoing one that's putting bad code into service. It appears to us that the behaviour of rolling restarts means that processing killing happens as follows:

  1. Processes leftover from the cancelled rolling restart are killed first as they're the oldest PIDs. These were good processes serving requests normally. They are replaced with processes running the exact same code, but which are now the lowest priority for request handling seeing as they're the newest PIDs.
  2. The processes running the bad code become the highest priority for handling requests given they are now the oldest PIDs. These processes are experiencing high request latency and thus Passenger's rolling restart can't kill them quickly; it's got to wait for the in flight request to finish up. While it's waiting on this to happen for the selected process to be restarted, the other processes running this code are still most preferred for inbound requests and are serving responses incredibly slowly/timing out.
  3. Given our pool size, it takes ~20 minutes to fully get these bad processes out of the pools and restore full operations.

(Purple in the above graph is yellow below, sorry) image