soulseekah / wptop

Performance statistics for your PHP code in production for WordPress.
60 stars 1 forks source link

Honour disabling yourself when disabled in the Wordpress dashboard #2

Open leehambley opened 9 years ago

leehambley commented 9 years ago

I recently had to fix a customer machine where they'd been running all their web requests with full profiling enabled (wptop_enable()) . That's really more their problem from oversight in development/ops than it is a problem of the plugins.

Perhaps you could raise a fatal error if profiling is started, but the plugin is disabled, I'm not sure.

I'm actually not a Wordpress or PHP developer, just a guy who got lucky solving someone's server hosting hell for them.

soulseekah commented 9 years ago

So they had deactivated the plugin but left wptop_enable() on? I think adding the http://codex.wordpress.org/Function_Reference/is_plugin_active check and suppress enable/disable would work in honour of disabling the plugin.

leehambley commented 9 years ago

That's exactly what they'd done, so it was causing pretty hellish problems. They were surprised that it was still running "but we disabled that weeks ago!?", and told me categorically that they had no profiling plugins running.

The smoking gun was that I couldn't actually strace the processes, at first because of the dynamic pooling … then with a fixed php-fpm pool it was easy to see they were spending the majority of their time in getrusage, I couldn't see the sample frequency in the code but more than 99% of the samples of the process according to strace were in getrusage. I spent some time looking around for why, and not being a PHP guy, concluded that it must be something about PHP-FPM master/slave resource management… since… as they'd said, they weren't profiling anything (also google searches for php getrusage lead to nothing but something about tuning a PHP system that talked to an Oracale database, which was spitting out too much getrusage).

Long story short, no data was being logged, and we found wptop_enable() in wp-config.php. I'm mostly writing the issue so that the next poor soul who has to debug it might stumble upon this.

soulseekah commented 9 years ago

Thanks. Acknowledged and will fix later today. How detrimental would you say the plugin was to application performance?

leehambley commented 9 years ago

Setup was 7 servers, 5 responsible for web. 8 cores (3.+ghz 64bit intel), 12Gb of RAM and all SSDs… the following script was enoguh to cripple the site with the plugin enabled:

for i in $(seq 1 15); do 
    curl --with-fair --options-for-keep-alive -H 'cookie: wp-login-cookie' http://their.site/wp-admin/ & 
done

The result was a near instant failure of PHP-FPM, since when being traced the worker processes don't seem to respond to signals.

A single request would respond fairly quickly, 45-50sec, with the plugin disabled, we're closer to 3-4 sec response times now which is as good as we can get (they have ~40-50 WP plugins, including some large ones line buddy press and LMS, which we are still trying to tune).

With wptop really disabled (no call to wptop_enable()) I basically can't kill the workers anymore, although they do seem to leak, that's solved with pm.max_requests, and works relatively well.

I'll point out whilst I'm here too that although they were using SSDs, they had a bind-mount of a remote NFS over the webroot, which causes things that should be fast like lstat(...) to be way slower than they should be, which also compounded the problem, since with remote locks on the NFS drive, and all syscalls taking fractionally longer because of the trace, the lock contention was much more of a problem than it ought to have been,

soulseekah commented 9 years ago

Interesting. Was /tmp/ mounted over NFS as well?

leehambley commented 9 years ago

/tmp wasn't being used, in the nginx config edit: it was in the php-fpm pool config they were exporting some variables to PHP-FPM which I think effectively had the effect of using a different tmp, which was set to <webroot>/tmp (so, yes.)

soulseekah commented 9 years ago

Oh, just realized, that wouldn't matter; all profiles are stored in the database, not the flat filesystem.

Also looking at the code I realize that since wptop_enable is designed to work without a WordPress context at all, loaded as early as possible, there's no way to identify whether the WordPress plugin is active (this check needs a database connection).

The result was a near instant failure of PHP-FPM, since when being traced the worker processes don't seem to respond to signals.

So what sort of failure? Did they freeze? Segfault? And what sort of signals were being sent.

It's quite surprising to see XHProf, a production-target profiler, fail so miserably.

leehambley commented 9 years ago

So what sort of failure? Did they freeze? Segfault? And what sort of signals were being sent.

Hard to say, they locked up, essentially, but without ever dying as far as the system is concerned. If they had entered uninterruptible IO (process state "D"), and stayed there, the system would have signalled them eventually, or PHP-FPMs timeouts would have done... (SIGTERM, or something I'd have guessed)

It seems however that because they never enter state D because the profiler keeps waking them up (also via signals, I think), then they are always state "R" (running). Because they were still "Running" things like iotop and the munin monitoring them shows them essentially as doing 0% IO, since the processes are never truly asleep waiting for IO to finish.

The interesting thing is that it was like a live-lock, the CPU and RAM were essentially idle and empty, and it was impossible to strace or measure the processes because of the xhprof getrusage calls spamming all the reports.

Also processes that never sleep don't always respond well to signals, as signals are buffered in the kernel, so if our processes were slowing down because of too many profiling signals the might be failing to drain the queue fast enough to respond to signal handlers for hard-timeouts, and other PHP-FPM pool management signals

leehambley commented 9 years ago

Also looking at the code I realize that since wptop_enable is designed to work without a WordPress context at all, loaded as early as possible, there's no way to identify whether the WordPress plugin is active (this check needs a database connection).

Indeed, I noticed this too, and of course that makes sense, since loading Wordpress itself has a cost, and that cost is an important part of deciding how to configure something like an object cache or APC.


In your shoes I might be tempted to set a variable (global? eek.) somewhere in wptop_enable(), and raise a fatal warning if then it turns out that when the wp context is loaded, that the plugin is disabled. At least to warn people that they need to do more to disable it.

Or, perhaps it's enoguh to fix the documentation in the wordpres admin UI that the plugin is only the report viewer, but that profiling will happen regardless as long as the "# Installation" guidelines were followed, as that exists before the WP context is up.

soulseekah commented 9 years ago

I'll think about adding a message upon plugin deactivation to remind people to remove the enable scaffolding. wptop_enable can't hook into the WordPress context at all, and since the plugin code is disabled and not running there's no way to run custom code in WordPress related to wptop, maybe during a shutdown function; like if during shutdown it's found that wptop as a plugin was never run emit a warning into logs.

I wonder whether the problem lies in not running xhprof_disable. Or were the same issues observed with the plugin active?

leehambley commented 9 years ago

We didn't enable the plugin again, but that's a valid point, if it's never stopped, where does the trace log go, if anywhere?

soulseekah commented 9 years ago

I'm guessing traces should be kept in memory of XHProf internally (I'd have to look at their source code later today), but who knows what happens to them if they're not stopped/dumped, especially in a static worker setup, maybe it keeps accumulating allocating heap for each request and never freeing it up. Will know more in a couple of hours after having a chance to look at XHProf closely.

leehambley commented 9 years ago

Cool: note the static change I made was mostly so that I could strace a known PID with some degree of certainty that it'd stick around. They were using a dynamic pool, with timeouts of 360s (:tears:)

leehambley commented 9 years ago

Perhaps a saner installation option might be to recommend that profiling runs on ~1% of requets…

<?php if (rand(0, 100) == 42) { wptop_enable(); } ?>