php / php-src

The PHP Interpreter
https://www.php.net
Other
37.93k stars 7.72k forks source link

OPCache File Inclusion and Function Call Errors: Letter Shifting Issue During High Load on PHP 8.3.10 #15539

Closed sebastianthulin closed 2 weeks ago

sebastianthulin commented 3 weeks ago

Hi,

I've been encountering a persistent and puzzling issue related to OPCache on a production server running PHP 8.3.10 (though the issue has also occurred on PHP 8.2.x). I'm hoping to get some insights or suggestions on why this error occurs.

Environment:

PHP Version: 8.3.10 (Issue also present in 8.2.x, a PHP update was made in order to mitigate the error) OPCache Version: 8.3.10 Web Server: LiteSpeed (with lsphp) Caching: OPCache enabled Traffic: High-traffic site Memory: 120GB RAM (utilize max 15GB of these) Disk: 1.5 TB (utilize 49GB) Hosting: Openstack Compute OS: Ubuntu 22.04.3 LTS (Jammy Jellyfish) Number of vhosts: 2 (two nearly identical codebases are deployed in /var/www/prod/ and /var/www/stage/) Deployed solution: WordPress (highly adapted to usecase)

Problem Description:

About once a day, we encounter file inclusion errors where filenames and function names appear to be "shifted" by one letter, leading to file not found or function does not exist errors. Examples include:

Problem Pattern:

We have identified a very clear pattern here. It is on any file include or function call, and are always one single letter that has been shifted to the left in the alphabet if an "b" in a string is replaced, it will be the letter "a" that replaces the correct char.

These issues seem to be tied to periods of high traffic when many uncached requests hit the server simultaneously (normally cached in pagecache). The problem resolves temporarily when PHP processes are restarted, but it eventually recurs. In som cases the issue may occur in 10 minutes after a reset, and in some cases up to 24 hours.

We have not concluded if this also affects variable names as this won't neccesarily result in a fatal error (in most cases we would se a warning about undefined var).

Actions taken so far:

I have also checked the server for any other active php versions, that may run and affect the cache and concluded that they are removed. We are only running lsphp 8.3.10.

Responses in this thread may be slow, due to the fact that the error is highly intermittent, please allow some time for responses due to that changes needs at least 48 hours to be tested.

Configuration line:

'./configure' '--prefix=/usr/local/lsws/lsphp83' '--enable-cgi' '--enable-cli' '--enable-phpdbg=no' '--enable-litespeed' '--with-config-file-path=/usr/local/lsws/lsphp83/etc/php/8.3/litespeed/' '--with-config-file-scan-dir=/usr/local/lsws/lsphp83/etc/php/8.3/mods-available/' '--build=x86_64-linux-gnu' '--host=x86_64-linux-gnu' '--libdir=${prefix}/lib/php' '--libexecdir=${prefix}/lib/php' '--datadir=${prefix}/share/php/8.3' '--program-suffix=8.3' '--disable-all' '--disable-debug' '--disable-rpath' '--disable-static' '--with-pic' '--with-layout=GNU' '--without-pear' '--enable-bcmath' '--enable-calendar' '--enable-ctype' '--enable-dom' '--with-enchant' '--enable-exif' '--with-gettext' '--with-gmp' '--enable-fileinfo' '--enable-filter' '--enable-ftp' '--enable-hash' '--with-iconv' '--enable-mbregex' '--enable-mbregex-backtrack' '--enable-mbstring' '--enable-phar' '--enable-posix' '--enable-mysqlnd-compression-support' '--with-zlib' '--with-openssl=yes' '--with-libedit' '--with-libxml' '--with-bz2' '--enable-session' '--enable-simplexml' '--enable-soap' '--enable-sockets' '--enable-tokenizer' '--enable-xml' '--enable-xmlreader' '--enable-xmlwriter' '--with-xmlrpc' '--with-xsl' '--with-mhash=yes' '--enable-sysvmsg' '--enable-sysvsem' '--enable-sysvshm' '--with-zip' '--with-sodium' '--with-password-argon2' '--with-system-tzdata' '--enable-gd' '--enable-gd-native-ttf' '--with-jpeg' '--with-xpm' '--with-png' '--with-webp' '--with-freetype' '--with-vpx-dir' '--with-mysql-sock=/var/run/mysqld/mysqld.sock' '--disable-dtrace' '--enable-pdo' '--enable-mysqlnd' '--enable-pcntl' '--with-curl=shared,/usr' '--with-imap=shared,/usr' '--with-kerberos' '--with-imap-ssl=yes' '--enable-intl=shared' '--with-ldap=shared,/usr' '--with-ldap-sasl=/usr' '--with-mysqli=shared,mysqlnd' '--with-pdo-mysql=shared,mysqlnd' '--with-pgsql=shared,/usr' '--with-pdo-pgsql=shared,/usr' '--with-pspell=shared,/usr' '--with-snmp=shared,/usr' '--with-sqlite3=shared,/usr' '--with-pdo-sqlite=shared,/usr' '--with-pdo-dblib=shared,/usr' '--with-tidy=shared,/usr' '--enable-opcache' '--enable-opcache-file' '--enable-huge-code-pages' 'build_alias=x86_64-linux-gnu' 'host_alias=x86_64-linux-gnu' 'CFLAGS=-g -O2 -ffile-prefix-map=/build/php8.3-8.3.10=. -flto=auto -ffat-lto-objects -flto=auto -ffat-lto-objects -fstack-protector-strong -Wformat -Werror=format-security -O2 -Wall -fsigned-char -fno-strict-aliasing -fno-lto -g'

OPCache Directives: opcache.enable = 1 opcache.enable_cli = opcache.use_cwd = 1 opcache.validate_timestamps = 1 opcache.validate_permission = opcache.validate_root = 1 opcache.dups_fix = opcache.revalidate_path = opcache.log_verbosity_level = 1 opcache.memory_consumption = 2147483648 opcache.interned_strings_buffer = 256 opcache.max_accelerated_files = 30000 opcache.max_wasted_percentage = 0.05 opcache.force_restart_timeout = 180 opcache.revalidate_freq = 10 opcache.preferred_memory_model = opcache.blacklist_filename = opcache.max_file_size = 0 opcache.error_log = opcache.protect_memory = opcache.save_comments = 1 opcache.record_warnings = opcache.enable_file_override = opcache.optimization_level = 2147401727 opcache.lockfile_path = /tmp opcache.file_cache = opcache.file_cache_only = opcache.file_cache_consistency_checks = 1 opcache.file_update_protection = 2 opcache.opt_debug_level = 0 opcache.restrict_api = opcache.huge_code_pages = opcache.preload = opcache.preload_user = opcache.jit = tracing opcache.jit_buffer_size = 1073741824 opcache.jit_debug = 0 opcache.jit_bisect_limit = 0 opcache.jit_blacklist_root_trace = 16 opcache.jit_blacklist_side_trace = 8 opcache.jit_hot_func = 127 opcache.jit_hot_loop = 64 opcache.jit_hot_return = 8 opcache.jit_hot_side_exit = 8 opcache.jit_max_exit_counters = 8192 opcache.jit_max_loop_unrolls = 8 opcache.jit_max_polymorphic_calls = 2 opcache.jit_max_recursive_calls = 2 opcache.jit_max_recursive_returns = 2 opcache.jit_max_root_traces = 1024 opcache.jit_max_side_traces = 128 opcache.jit_prof_threshold = 0 opcache.jit_max_trace_length = 1024

Interned strings buffer:

buffer_size = 268435456 used_memory = 83099352 free_memory = 185336104 number_of_strings = 229401

Opcache Status (newly restarted):

num_cached_scripts = 6295 num_cached_keys = 8895 max_cached_keys = 32531 hits = 28237442 start_time = 1724313429 last_restart_time = 0 oom_restarts = 0 hash_restarts = 0 manual_restarts = 0 misses = 8602 blacklist_misses = 0 blacklist_miss_ratio = 0 opcache_hit_rate = 99.969546177865

Resources found on similar issues:

PHP Version

8.3.10

Operating System

Ubuntu 22.04.3 LTS (Jammy Jellyfish)

iluuu1994 commented 3 weeks ago

Hi @sebastianthulin. Thank you for the description.

Under actions taken you say "Turned off JIT opcache.jit". Did this have any effect?

As you may guess, it's very hard to locate such issues without a reproducer. There's likely some memory corruption going on. I'm guessing it's not possible to test this on some other SAPI? Do you have any other non-bundled extensions enabled?

cmb69 commented 3 weeks ago

Also try enabling opcache.protect_memory (this should avoid the corruptions, but likely crashes the process instead).

sebastianthulin commented 3 weeks ago

Hi @sebastianthulin. Thank you for the description.

Under actions taken you say "Turned off JIT opcache.jit". Did this have any effect?

As you may guess, it's very hard to locate such issues without a reproducer. There's likely some memory corruption going on. I'm guessing it's not possible to test this on some other SAPI? Do you have any other non-bundled extensions enabled?

Hi,

Thank you for your response.

Unfortunately, none of the actions we've taken have had any effect—the issue persists and exhibits the same behavior across all configurations we've tested. We are also suspecting some form of memory corruption, which seems plausible in this case. While switching to another SAPI might be challenging, I'll look into it (though I know that options like PHP-FPM, for instance, are not viable).

Here’s the output from php -m, showing the currently enabled extensions (with no additional non-bundled extensions):


[PHP Modules]
bcmath
bz2
calendar
Core
ctype
curl
date
dom
enchant
exif
fileinfo
filter
ftp
gd
gettext
gmp
hash
iconv
igbinary
imagick
json
libxml
mbstring
mysqli
mysqlnd
newrelic
openssl
pcntl
pcre
PDO
pdo_mysql
pdo_sqlite
Phar
posix
random
readline
redis
Reflection
session
SimpleXML
soap
sockets
sodium
SPL
sqlite3
standard
sysvmsg
sysvsem
sysvshm
tokenizer
xml
xmlreader
xmlwriter
xsl
Zend OPcache
zip
zlib

[Zend Modules]
Zend OPcache
iluuu1994 commented 3 weeks ago

@sebastianthulin Thank you for your quick response! You do have a few 3rd party extensions enabled. I reckon not all of them can be disabled, maybe some can? E.g. newrelic. Of course, it's very much possible the issue is on our side but it's helpful to narrow down the problem.

And as Christoph mentioned, opcache.protect_memory=1 might indicate whether this could be a concurrency issue. (Or at least whether the corruption happens in request or shared memory.)

sebastianthulin commented 3 weeks ago

Also try enabling opcache.protect_memory (this should avoid the corruptions, but likely crashes the process instead).

I will definitely give that a try. Although I’ve read the documentation, I’ve avoided using it in a production environment because it’s mentioned as “useful for internal debugging only.”

Could you clarify the impact on our users if we enable this? Besides potential process crashes, would it noticeably affect performance?

iluuu1994 commented 3 weeks ago

This setting sets a readonly flag on the shared memory area using mprotect. When persisting a script, the memory needs to be unprotected, and protected again. This might slow down the initial compilation, but I don't think it should have a big impact after that. The idea is recognizing when the process accidentally modifies any of this memory. Given it is read by multiple processes at once, it's not supposed to be modified during requests. It's only meant to be appended to.

sebastianthulin commented 3 weeks ago

@sebastianthulin Thank you for your quick response! You do have a few 3rd party extensions enabled. I reckon not all of them can be disabled, maybe some can? E.g. newrelic. Of course, it's very much possible the issue is on our side but it's helpful to narrow down the problem.

And as Christoph mentioned, opcache.protect_memory=1 might indicate whether this could be a concurrency issue.

I understand. We’ll work on disabling as many as possible. I suspect we added New Relic after the errors began to better analyze the issue, and it has been invaluable in gathering structured data. I’ll look into it further and see if we have any records of the errors before New Relic was installed.

sebastianthulin commented 3 weeks ago

This setting sets a readonly flag on the shared memory area using mprotect. When persisting a script, the memory needs to be unprotected, and protected again. This might slow down the initial compilation, but I don't think it should have a big impact after that.

Thank you for the clarification!

sebastianthulin commented 3 weeks ago

Unfortunately, enabling opcache.protect_memory=1 significantly increased processing time. It jumped from 600ms to around 10 seconds.

Would disabling file revalidation or similar features improve performance enough to make this option feasible?

sebastianthulin commented 3 weeks ago

This setting sets a readonly flag on the shared memory area using mprotect. When persisting a script, the memory needs to be unprotected, and protected again. This might slow down the initial compilation, but I don't think it should have a big impact after that. The idea is recognizing when the process accidentally modifies any of this memory. Given it is read by multiple processes at once, it's not supposed to be modified during requests. It's only meant to be appended to.

Based on this reasoning, could reducing the number of PHP processes help minimize the issue? Currently, we allow a high number of simultaneous processes due to our ample computing resources. Lowering this number might validate the thesis to some extent.

iluuu1994 commented 3 weeks ago

Potentially. Allowing many parallel requests can cause cache stampede. How is this value configured for you? It should approximate the number of cores available, with a bit extra. But it can depend on the type of application.

sebastianthulin commented 3 weeks ago

Potentially. Allowing many parallel requests can cause cache stampede. How is this value configured for you? It should approximate the number of cores available, with a bit extra. But it can depend on the type of application.

This is clearly misconfigured then. We allow 32, but only have 8 cores. I will make an adjustment on this value. I do not think that it will spawn more than a 10 a normal day.

sebastianthulin commented 3 weeks ago

@sebastianthulin Thank you for your quick response! You do have a few 3rd party extensions enabled. I reckon not all of them can be disabled, maybe some can? E.g. newrelic. Of course, it's very much possible the issue is on our side but it's helpful to narrow down the problem.

And as Christoph mentioned, opcache.protect_memory=1 might indicate whether this could be a concurrency issue. (Or at least whether the corruption happens in request or shared memory.)

Following up on the NewRelic idea: I've found several articles on their forums discussing "50x" errors related to OpCache, which suggests there might be a correlation. I plan to investigate this further tomorrow, as I'm in the GMT+2 timezone.

iluuu1994 commented 3 weeks ago

It's not always wrong to set it to a higher value, if your application is IO heavy. But it can lead to more context switches, and higher cache stampede. It's best to test what is most performant.

sebastianthulin commented 2 weeks ago

Update on the Situation: We've reduced the number of allowed workers, which should theoretically mitigate the issue of simultaneous cache access. Unfortunately, the update to the New Relic extension hasn't resolved the problem, as errors have continued to occur since the update. We're currently assessing the situation and planning our next steps.

sebastianthulin commented 2 weeks ago

Our plan moving forward is to completely disable New Relic (we are currently testing this in production). Are there any other extensions that have been known to cause similar issues in the past? Since it’s not feasible to disable all of them, an ordered list of known 'offenders' would be helpful.

We also tested flushing the OPcache using the opcache_reset() function instead of restarting PHP, and it produced good results.

iluuu1994 commented 2 weeks ago

Thank you for your continued efforts. I'm not familiar enough with the 3rd party extensions enabled in your project to tell whether they can cause this issue. Of course, it's possible that no extension is to blame at all, and this is entirely caused by core, but without a hunch, eliminating unnecessary extensions is a good first step.

The other 3rd party extensions I can see are igbinary, imagick and redis. I'm not sure if it's possible to remove any of these.

How many requests do you handle per day? Is this issue reproducible by simulating these requests in a development environment?

sebastianthulin commented 2 weeks ago

Hi,

We handle approximately 300 requests per minute (counting those that reach PHP; many more are served from a static cache). In my opinion, this issue could be reproducible under certain conditions. Yesterday, we removed the New Relic extension, and since then, we haven’t experienced any issues with broken OPCache. Given New Relic’s history with this issue, I’m going to consider this problem resolved on our end and will instead open a ticket with them.

Thank you for your time and assistance—this discussion has been very helpful!

opcache #newrelic #offbyone #missspelledfunctions #missspelledincludes

iluuu1994 commented 2 weeks ago

Great, thanks for the feedback! Feel free to reopen if the issue returns.