humanmade / Cavalcade-Runner

Daemon for Cavalcade, a scalable WordPress jobs system.
https://engineering.hmn.md/projects/cavalcade/
53 stars 26 forks source link

Proposal: exit runner completely, defer to Systemd for restart #48

Closed nathanielks closed 6 years ago

nathanielks commented 6 years ago

TL;DR:

I'm concerned about php's ability to be a long running script, so I'm wondering what you would think about the following:

What if, instead of using a while loop and sleeping, we run the script to completion and allow upstart/systemd to manage restarting the process? I'm testing it currently over the last ~3 hours and memory consumption is consistent; it's not growing.

I added the following to the systemd definition:

RestartSec=1
StartLimitBurst=0

I realize means we'd have to spawn a connection to the database every second, which I don't think is terrible, but not ideal either. This could be changed to 5 second or 10, perhaps.


We've been experiencing some issues with memory leaks lately, so I've been doing some testing. I used Xdebug's function tracing ability to see which functions are consuming memory, and here are the top 10 consumers:

PDOStatement->execute                              896  1.2630 33392944  1.2630 33392944
sleep                                              896  895.2948 16176472  895.2948 16176472
PDO->prepare                                       896  1.1511  7825664  1.1511  7825664
HM\Platform\boostrap_cavalcade_runner                1  0.0044   531408  0.0013   220472
HM\Cavalcade\Runner\autoload                         4  0.0011    76504  0.0009    76168
include                                              7  0.0148   575480  0.0017    34376
PDO->__construct                                     1  0.0389    11392  0.0389    11392
HM\Cavalcade\Runner\Runner->bootstrap                1  0.0551   591656  0.0002     4784
HM\Cavalcade\Runner\Hooks->register                  3  0.0003     2704  0.0001     2704
define                                              51  0.0005     2000  0.0005     2000

Seeing as PDOStatement->execute was the top consumer, I went through the code and unset($statement) after each statement and ran Cavalcade for some time. Even with this, I can see memory usage growing.

A flaw in my thought process is that the function trace shows the memory consumed over the course of the script, but not necessarily what's being consumed at that moment. I made use of https://github.com/BitOne/php-meminfo and https://github.com/arnaud-lb/php-memory-profiler, and they unfortunately didn't reveal anything. Well, they did reveal that the application isn't leaking anything because the memory allocated was always well below what was allocated to the process.

One thing that is consistent is that sleep seems to consume a lot of memory. Reading around, I've seen that php isn't really designed to be a long running process, so I'm concerned about it in this instance.

rmccue commented 6 years ago

How bad is the memory leak issue? I'd really rather avoid us doing a hack like just running a process every second, particularly because it means we need to incur (at least partially) the WordPress bootstrap overhead every time. I suspect that our memory leaks are likely due, at some level, to the code I wrote, and not inherently to PHP itself.

If doing unset( $statement ) helps, let's add that, since it's an uncontroversial change to make. There may be more that we can do here as well, like $statement->closeCursor(). We should also test with a fake DB abstraction to see if the memory leaks could be internal to PDO itself; I would be unsurprised if sleep() is consuming memory because PDO sees idle time and does some internal things. (We could look at restarting the DB connection periodically if that helps as well.)

Additionally, what if we call gc_collect_cycles() in the loop? (Maybe every minute rather than every cycle.)

Restarting the process constantly is really the nuclear option, and it makes Cavalcade less portable to other systems, so I'd rather avoid it if at all possible.

nathanielks commented 6 years ago

Aye, all good points. unseting didn't seem to do anything substantial, so I'll give closing the cursor and garbage collecting a try!

nathanielks commented 6 years ago

As far as how bad it is, here's a graph of 4 environments over 3 days:

screen shot 2018-06-27 at 12 20 21 pm

I haven't been able to determine why some environments rise so sharply and others are a much slower burn, but the memory growth is definitely there.

nathanielks commented 6 years ago

Unfortunately, running the garbage collector doesn't clear out the memory.

nathanielks commented 6 years ago

This looks suspiciously similar: https://github.com/doctrine/dbal/issues/3047

nathanielks commented 6 years ago

I was curious if PDO really was the issue, so I switched PDO out for mysqli in the Runner class. This has made a significant difference. Looking at the function traces I can see that, when closing out the cursor, the memory consumed by mysqli is getting cleaned up:

function                                 #calls  time     memory  time     memory
---------------------------------------------------------------------------------
mysqli_stmt->get_result                     630  0.0152 12605776  0.0152 12605776
mysqli->prepare                             630  0.7920  5175392  0.7920  5175392
/// a few other functions
mysqli_stmt->close                          630  0.0188 -5019840  0.0188 -5019840
HM\Cavalcade\Runner\Runner->get_next_job    630  1.8087    50864  0.0448 -12746160

Unfortunately I can also see that sleep is still accruing memory that's not getting cleaned up:

sleep                                       324  324.1002  5327808  324.1002  5327808
// A few minutes later
sleep                                       629  629.1984 10413016  629.1984 10413016

( It grows from 5M to 9M ).

This alone is an improvement. With PDO, I saw that memory was increasing 10M each minute. Switching out for mysqli reduces that to 2M/minute.

I'm going to watch the process over the next day and see what it does. It looks like sleep is still an issue 🤔

nathanielks commented 6 years ago

I wanted to rule out sleep, so I ran this:

php -r 'while (true) { echo (memory_get_peak_usage() / 1024 /1024 ) . "M" . PHP_EOL; sleep(10); }'

The number didn't change over the course of 2 minutes, so that rules out sleep as the offender. It also doesn't show that it's consuming anything:

function             #calls  time     memory  time     memory
-------------------------------------------------------------
memory_get_peak_usage    13  0.0003        0  0.0003        0
sleep                    12  120.0026        0  120.0026        0

This makes me wonder if the debugging tools are attributing memory to sleep that doesn't belong to it 🤔


Regarding the graph from earlier, I have consistently seen the high rise and falls correlate with environments that don't run any jobs.

nathanielks commented 6 years ago

I've stripped the while loop in the runner down to the following and memory consumption still grows:

while ( true ) {
    printf( '%dK' . PHP_EOL, ( memory_get_peak_usage() / 1024 ) );
    sleep( 1 );
    continue;
}

This suggests to me that something outside of the loop is causing the memory to grow because running a while loop by itself like above did not cause memory to grow.

nathanielks commented 6 years ago

Aha! I've narrowed it down to xhprof! It occurred to me to try disabling the bootstrapping function to see what might be causing memory to grow outside the loop, which led me to the config of the instance I'm testing on. The leak appears to be associated with xhprof. When XHProf sampling is disabled, memory usage doesn't grow considerably, if at all. I've only monitored it for the last few minutes, so I'll keep my eye on it to see if it gets out of hand, but that appears to be the issue and not a leak with Cavalcade itself 😌

nathanielks commented 6 years ago

I tested out if we could get xhprof sampling to work with Cavalcade. I tested it by disabling sampling in wp-config.php, enabling it at the beginning of the while loop and disabling whenever the loop goes to move on to the next iteration. Memory did not grow using this method 🙌 🎉

This does look like we would need to add some more hooks to Cavalcade so we could trigger these actions. Would you be in support of that @rmccue?

rmccue commented 6 years ago

If you turn off XHProf and make no other changes to Cavalcade, does it still have a memory leak?

If no, let’s close this issue, as it’s a memory leak in XHProf, so not relevant to this repo.

(I’d recommend we disable XHProf entirely for Cavalcade Runner, as we don’t need to profile it. We also need to fix the memory leak in the extension.)

nathanielks commented 6 years ago

Correct: with XHProf turned off, there is no memory leak. Closing!