tideways / php-xhprof-extension

This XHProf PHP extension fork has outlived its purpose and is archived in favor of the revitalized https://github.com/longxinH/xhprof
https://tideways.com
Apache License 2.0
1.64k stars 209 forks source link

Segfault on random requests #89

Open liwo opened 5 years ago

liwo commented 5 years ago

We have tried to use tideways xhprof on one of our servers, which for reasons outside our immediate control still runs on PHP 7.1.

The extension runs, but on every third request (as far as we could determine) we get a segfault. PHP is setup with FPM, and as far as I can see we had two children running. So every third request could well be the second request to a given FPM child. Apparently it doesn't matter which script we are calling, the same script can run fine and produce a profile and then segfault the next time. We tried with version 4.1.7 (from ubuntu PPA of Ondrej Sury) and 5.0.1 downloaded from github release page. Both result in a segfault, although with different stack traces.

Stack trace from 4.1.7:

#0  0x0000563291c436c8 in zend_hash_destroy ()
#1  0x00007ff33bf15f54 in ?? () from /usr/lib/php/20160303/tideways.so
#2  0x00007ff33bf1cad2 in zif_tideways_enable () from /usr/lib/php/20160303/tideways.so
#3  0x0000563291cd51ba in ?? ()
#4  0x0000563291c7c11b in execute_ex ()
#5  0x0000563291cd73a0 in zend_execute ()
#6  0x0000563291c32313 in zend_execute_scripts ()
#7  0x0000563291bce310 in php_execute_script ()
#8  0x0000563291aac189 in main ()

Stack trace from 5.0.1:

#0  0x00007f4c0290c3cb in tideways_xhprof_execute_ex () from /usr/lib/tideways_xhprof/tideways_xhprof-7.1.so
#1  0x000055bc574e4ecb in zend_call_function ()
#2  0x000055bc575138a9 in zend_call_method ()
#3  0x000055bc573f010e in zif_spl_autoload_call ()
#4  0x00007f4c0290d166 in tideways_xhprof_execute_internal () from /usr/lib/tideways_xhprof/tideways_xhprof-7.1.so
#5  0x000055bc574e50d3 in zend_call_function ()
#6  0x000055bc574e5471 in zend_lookup_class_ex ()
#7  0x000055bc574e5e28 in zend_fetch_class_by_name ()
#8  0x000055bc57543d09 in ?? ()
#9  0x000055bc5753f11b in execute_ex ()
#10 0x00007f4c0290c287 in tideways_xhprof_execute_ex () from /usr/lib/tideways_xhprof/tideways_xhprof-7.1.so
#11 0x000055bc575947dc in ?? ()
#12 0x000055bc5753f11b in execute_ex ()
#13 0x00007f4c0290c402 in tideways_xhprof_execute_ex () from /usr/lib/tideways_xhprof/tideways_xhprof-7.1.so
#14 0x000055bc5759821a in ?? ()
#15 0x000055bc5753f11b in execute_ex ()
#16 0x00007f4c0290c402 in tideways_xhprof_execute_ex () from /usr/lib/tideways_xhprof/tideways_xhprof-7.1.so
#17 0x000055bc57597dae in ?? ()
#18 0x000055bc5753f11b in execute_ex ()
#19 0x00007f4c0290c287 in tideways_xhprof_execute_ex () from /usr/lib/tideways_xhprof/tideways_xhprof-7.1.so
#20 0x000055bc57594eac in ?? ()
#21 0x000055bc5753f11b in execute_ex ()
#22 0x00007f4c0290c287 in tideways_xhprof_execute_ex () from /usr/lib/tideways_xhprof/tideways_xhprof-7.1.so
#23 0x000055bc57594eac in ?? ()
#24 0x000055bc5753f11b in execute_ex ()
#25 0x00007f4c0290c287 in tideways_xhprof_execute_ex () from /usr/lib/tideways_xhprof/tideways_xhprof-7.1.so
#26 0x000055bc5759a3a0 in zend_execute ()
#27 0x000055bc574f5313 in zend_execute_scripts ()
#28 0x000055bc57491310 in php_execute_script ()
#29 0x000055bc5736f189 in main ()

As the stack traces are quite different I am not entirely sure it is the same issue or not, but from the outside view it shows the exact same symptoms.

My knowledge of gdb is pretty much exhausted at this point, so if we can provide more useful details, please advice us. I will leave the relevant stack traces lying around for the time being.

Danack commented 5 years ago

Hi,

Thanks for the report.

We might need to have some more information to figure out what is going on. Is there any chance of you sharing a reproducible case in a docker image?

If not, please could you give us some more information:

Thanks Dan

liwo commented 5 years ago

Hi Dan,

thanks for your time.

Creating a docker image will be hard, as this is a server with several different PHP applications running on it.

The application we are trying to profile is a shopware instance.

The list of extensions:

[PHP Modules]
apcu
bcmath
bz2
calendar
Core
ctype
curl
date
dom
exif
fileinfo
filter
ftp
gd
gettext
gmp
hash
iconv
imagick
imap
intl
json
libxml
mbstring
mcrypt
memcache
mysqli
mysqlnd
openssl
pcntl
pcre
PDO
pdo_mysql
pdo_sqlite
Phar
posix
pspell
readline
recode
Reflection
session
shmop
SimpleXML
sockets
SPL
sqlite3
standard
sysvmsg
sysvsem
sysvshm
tidy
tokenizer
wddx
xml
xmlreader
xmlrpc
xmlwriter
xsl
Zend OPcache
zip
zlib

[Zend Modules]
Zend OPcache

xdebug is not installed on the server. We can try to remove opcache tomorrow and see what happens.

Again, thanks for your time.

Best, Lienhart

Danack commented 5 years ago

Actually, another thought - even though you might not be able to upgrade to the latest version of PHP right now, please could you try reproducing the issue against PHP 7.3?

There were a number of segfaults that have been fixed, that may be lurking, but only showing when the xhprof extension is also being run.

Danack commented 5 years ago

If disabling opcache does make the segfault stop happening, you could then look at turning off the particular optimisation that is likely causing the issue. If you have the default setting, you probably have the opcache optimization level set as:

opcache.optimization_level=0x7FFFBFFF

Looking at the code, turning off the final bit, would disable an optimisation that has caused a similar issue https://bugs.php.net/bug.php?id=73654

opcache.optimization_level=0x7FFFBFFE

Or you could turn more of the optimisations off and try to find the particular setting that is causing the issue.

opcache.optimization_level=0x7FFFBF00
liwo commented 5 years ago

Hi Dan,

thanks for your time!

We disabled opcache completely, unfortunately the segfaults continued. So it seems like opcache is not the (only) problem here. Any other thoughts?

Danack commented 5 years ago

Any other thoughts?

It would definitely be worth trying against PHP 7.3. Even if your application might not fully work against that version of PHP, as there were quite a few known issues in 7.1 (which means there are likely still some unknown issues) similar to this issue, figuring out if it's still happening in 7.3 or not would be useful information. My suspicion is that it would still be present in 7.3.

Other than that, someone will need to do some debugging with GDB to figure out what variable is being corrupted that is causing the segfault. If you focus on the 5.0.1 version please.

I've put some notes below of how to do that.

Looking at the stack trace, it seems that the problem is occurring when loading a class, through a userland registered autoloader,

#0  0x00007f4c0290c3cb in tideways_xhprof_execute_ex () from /usr/lib/tideways_xhprof/tideways_xhprof-7.1.so
#1  0x000055bc574e4ecb in zend_call_function ()
#2  0x000055bc575138a9 in zend_call_method ()
#3  0x000055bc573f010e in zif_spl_autoload_call ()
#4  0x00007f4c0290d166 in tideways_xhprof_execute_internal () from /usr/lib/tideways_xhprof/tideways_xhprof-7.1.so
#5  0x000055bc574e50d3 in zend_call_function ()
#6  0x000055bc574e5471 in zend_lookup_class_ex ()
#7  0x000055bc574e5e28 in zend_fetch_class_by_name

Which is weird, on multiple levels. First, this code is used everywhere in every PHP server....so any bugs in it are normally found very quickly. Second, most of the time, the code isn't doing anything interesting....so there's very little scope for edge-cases.

I think if you follow the notes below, you should be able to find either the exact cause of the bug, some decent clues as to what the problem is, or possibly if you look at the data in the frame for zend_fetch_class_by_name, you'll be able to see what class is being loaded, and look in your app to see if there's anything out of the ordinary for loading that code.


Instructions for GDB to investigate what is causing the problem.

The instructions below won't completely identify what the cause of that corruption is, but should provide more information which will be the base of further investigation.

  1. Install GDB onto a machine that is running PHP-FPM where the issue can be reproduced.

  2. Change the number of PHP-FPM processes to 1.

This step is optional, but it should make life easier as it means that PHP-FPM will only have 1 process running that is processing requests, which is easier than trying to pick out the right process.

pm = static
pm.max_children = 1
pm.start_servers = 1
pm.min_spare_servers = 1
pm.max_spare_servers = 1
  1. Identify the PID of the PHP process that will be processing requests.

Using something like top or ps auwx | grep php-fpm | grep -v grep

For me, that gives the output of:

root     16580  0.0  1.3 215272 26664 pts/0    S+   10:12   0:00 php-fpm: master process (/var/app/docker/php_fpm/config/fpm.conf)
www-data 16581  0.0  0.3 215272  6588 pts/0    S+   10:12   0:00 php-fpm: pool www

The 'master process' is the process that spawns the worker processes. The 'pool www' process is the one that we want the PID for, so in this case it is 16581

  1. Launch GDB attaching it to the process
gdb -p 16581

If you see an error about ptrace: Operation not permitted. and you're running the code inside a container, please see note 0 below.

  1. Resume execution of the PHP process.

When GDB attaches to a process, it stops execution of the program, so that you can do things like setup breakpoints. To make PHP run again, so that we can trigger the breakpoint, we need to tell it to continue.

So run the command:

continue
  1. Do whatever you're doing to trigger the segfault.

That should be picked up by GDB which will be back in interactive mode, and you'll be able to use GDB to look at the data.

  1. Setup to inspect the data.

Assuming that you can set all the above up and it doesn't magically make the bug disappear, you should be left with the debugger stopped in a place where we can get useful information.

For all of the remaining steps that display values, please can you save them to a file, so that they can be shared with Tideways.

First import some GDB scripts, which will allow nicer formatting of PHP values that are held in memory. This can be done by 'sourcing' the gdb scripts file inside GDB.

The file for that can be downloaded from https://github.com/php/php-src/blob/PHP-7.1.30/.gdbinit

source /var/app/.gdbinit

Heads up

After this point you'll need to figure out what steps to do from looking at the output.

I have a strong suspicion still that this is going to be a bug in PHP which is only showing up due to changes in the memory layout caused by loading the Tideways extension, rather than it being a bug in the extension itself.

  1. Actually printing the useful info.

So for the segfault in 5.0.1, the function that it is occuring in is tideways_xhprof_execute_ex

That has a single parameteter of zend_execute_data *execute_data, so probably something in that data is borked.

You should be able to see exactly which line the problem is occurring on, and so may be able to figure out what part of that data structure is apparently invalid.

print execute_data

And then to print just the individual parts of the data structure would be for example:

print execute_data->func->function_name

function_name is a pointer to a zend_string, and this command would print the pointer as a pointer.

print_zstr execute_data->func->function_name

Would print the function name as a string..

The command 'print_zstr' is define in the .gdbinit file, so will only be available if that has been sourced into GDB.

  1. Switching frames.

The error is possibly originating in one of the previous function calls, and only showing as a problem in the final function.

The command to show the list of frames is:

bt

To switch to a particular frame, you can run the GDB command f $frame_number. e.g. to switch to the 2nd frame in the stack:

f 2

You can also get some more information about the frames.

f

Prints a brief description of the currently selected stack frame.

info f

This command prints a verbose description of the selected stack frame.

Next steps

We'll need to look at the output from that debugging to figure out the next step, as how to proceed will depend on what we see.

The data corruption is in the same memory location each time.

This is the best case scenario. Setting a breakpoint on the exact memory location that is being changed after starting GDB but before triggering the error with:

watch *0x5620b2bb832d

Should allow us to find what is causing the corruption by running bt when that breakpoint fires.

The data corruption is obviously a string being written to the wrong place.

There's a small chance the data that is being written to the wrong place will be a string, that contains enough information by itself to work back to where it's being written from

You can use the following commands to inspect the memory Display the 32bytes of memory at the address as hex values

x/32xb 0x5620b2bb832d

Display the 32bytes of memory at the address as a string

x/32sb 0x5620b2bb832d

WARNING this has the capability to include private data/api keys. You should check it's nothing private before sharing it with us.

Notes

  1. Docker permissions.

(This probably also applies toother container systems, not just Docker.)

By default, Docker does not give permissions to run certain operations inside a container. For Docker-compose this can be added in the docker-composer.yml file, by adding:

    cap_add:
      - SYS_ADMIN
      - SYS_PTRACE
    security_opt:
      - seccomp:unconfined

to the service entries that are running the process you wish to debug.

beberlei commented 5 years ago

@liwo Can we help you out in any way to find the cause for this segfault?

liwo commented 5 years ago

Hi Ben,

sorry for the late reply. I was a bit caught up in other stuff the past couple days and my colleague managed to get enough profiling data to work with (as only every 3rd request died it was possible to profile, but not convenient ;-)), so I didn't have pressure. I'm planning to debug the issue with the info provided by Dan next week.

liwo commented 5 years ago

Hi Dan, Ben,

it took me some time and I had trouble reproducing the issue today on the same server. I only got error messages Call to undefined function tideways_xhprof_enable() on the second or third request. But that got us quite a bit further.

In the end, I managed to reliably reproduce a completely weird behavior in a small docker environment: https://gist.github.com/liwo/807106b9577087380152bd8bf8bd3012.

Run it with docker-compose up and open http://localhost:8080/info.php

I experience the following behavior:

Now replace info.php with the following contents:

<?php
if (extension_loaded("tideways")) {
    tideways_enable(TIDEWAYS_FLAGS_CPU | TIDEWAYS_FLAGS_MEMORY);
} elseif (extension_loaded("tideways_xhprof")) {
    tideways_xhprof_enable(TIDEWAYS_XHPROF_FLAGS_CPU | TIDEWAYS_XHPROF_FLAGS_MEMORY);
}

phpinfo();

Shouldn't make a difference in behavior, it just checks which version (4 or 5) of tideways is installed and uses the appropriate function to enable xhprof. Well, no. Instead of segfaulting, this time the tideways_xhprof_enable function disappears after the second request:

Fatal error: Uncaught Error: Call to undefined function tideways_xhprof_enable() in /var/www/html/info.php:5 Stack trace: #0 {main} thrown in /var/www/html/info.php on line 5

I have no idea why this small code change prevents the segfault.

I also tried to use 7.3 in the Dockerfile (see commented lines), but then PHP-FPM doesn't even start, it segfaults right away.

As it's getting late I'm not going to start gdb in the docker container and try my luck. Instead I hope this helps you to reproduce the segfault.

Thanks again for your time!

beberlei commented 5 years ago

@liwo thank you for this, we were able to see the crash as well. This is extremely helpful. We are getting back to you as soon as we know what the culprit is.