php / php-src

The PHP Interpreter
https://www.php.net
Other
37.93k stars 7.72k forks source link

php_opcache crash - memory access violation #7915

Open jjdunn opened 2 years ago

jjdunn commented 2 years ago

Description

I don't know how to recreate the problem. it occurs on our production server, sporadically. 6x in past 4 days. never happened on old server (Windows 2012 R2) running same versions of PHP (8.0.14 x64 thread-safe) and apache httpd (2.4.51 x64), and the same opcache configuration. We are not using JIT.

Windows Event Log shows:

Faulting application name: php.exe, version: 8.0.14.0, time stamp: 0x61bb14bc Faulting module name: php_opcache.dll, version: 8.0.14.0, time stamp: 0x61bb1862 Exception code: 0xc0000005 Fault offset: 0x0000000000002fe6 Faulting process id: 0x25d8 Faulting application start time: 0x01d8045afae2031a Faulting application path: C:\Program Files\php\php.exe Faulting module path: C:\Program Files\PHP\ext\php_opcache.dll Report Id: d7c91f64-1d66-4a3c-a206-7f1efaf4d8cc Faulting package full name: Faulting package-relative application ID:

php.ini:

[PHP_OPCACHE]
zend_extension=php_opcache.dll
opcache.enable=1;
opcache.enable_cli=1;
opcache.memory_consumption=512;
opcache.interned_strings_buffer=32;
opcache.validate_timestamps=1;
opcache.save_comments=0;
opcache.revalidate_path=1;
opcache.enable_file_override=1;
opcache.max_file_size=0;
opcache.file_cache_only=0;
opcache.log_verbosity_level=2;
opcache.mmap_base = 0x20000000
opcache.jit_buffer_size=0;
opcache.jit=Off;

Running Windows DebugDiag tool, with PHP 8.0.14 symbols loaded, gives the report attached here: opcache-crash-analysis.pdf

PHP Version

PHP 8.0.14

Operating System

Windows Server 2019; apache httpd 2.4.51 x64

jjdunn commented 2 years ago

additional information:

the crash has occurred 6x; each crash produces (2) .dmp files. Just ran DebugDiag on all 12 files.

Every one of the errors looks like this:

In php.exe_220104_074301.dmp the assembly instruction at php_opcache!accel_post_startup+1b4 in C:\Program Files\php\ext\php_opcache.dll from The PHP Group has caused an access violation exception (0xC0000005) when trying to read from memory location 0x00000060 on thread 0

or like this:

In php.exe_220107_224305-1.dmp the assembly instruction at php_opcache!accel_new_interned_string+66 in C:\Program Files\php\ext\php_opcache.dll from The PHP Group has caused an access violation exception (0xC0000005) when trying to read from memory location 0xa0000148 on thread 0

all of the errors are in accel_new_interned_string or accel_post_startup

in every case, the php process which crashed is a console application, NOT running in apache httpd.

cmb69 commented 2 years ago

I guess the issues on Windows Server 2019 are due to more aggressive ASLR.

Are you running multiple PHP CLI process in parallel? Either toplevel process, or PHP processes spawned by a single toplevel PHP process. If so, would assigning a unique opcache.cache_id to these processes help?

Also consider to set opcache.log_verbosity_level=4 and check the opcache error log for any suspicious messages.

jjdunn commented 2 years ago

Thanks @cmb69 for the speedy reply. Yes we are running tons of CLI processes: ~2000/day on this server. However the schedule of these jobs is unchanged from the old server.

Keeping a separate opcache.cache_id would defeat some of the value of caching - since these processes use many of the same files - and would consume vastly more memory (there are currently 22 separate CLI processes), so I don't think that's a realistic solution.

I've changed verbosity_level=4 as suggested and added opcache.error_log configuration. Will post details here after the next crash...which I presume will occur.

cmb69 commented 2 years ago

Yeah, makes probably sense to wait on hopefully helpful log messages.

However, sharing OPcache instances for "unrelated" (i.e. non forked) processes is a quirk on Windows, and can't generally work, since these processes (especially if dynamically loaded extensions are involved) can have different memory layout (ASLR), but OPcache partly relies on a fixed memory layout, and the only mitigation on Windows is checking that a single central function (zend_execute_ex) is at the same address when reattaching.

jjdunn commented 2 years ago

hmmm... I was not aware of the point you made in the previous comment. Is it possible for each CLI process to set opcache.cache_id dynamically, using something like ini_set("opcache.cache_id", <processID>); ?

And, if I did that, would the opcache configurations (especially opcache.memory_consumption=512;) apply to each of the CLI processes?? that's a lot of memory...unless I dial-down the memory_consumption for each of the CLI...

jjdunn commented 2 years ago

p.s. reviewing the php_opcache installation instructions: https://www.php.net/manual/en/opcache.installation.php

I see that my configuration lacked the recommended opcache.file_cache_fallback=1 and opcache.file_cache="C:\windows\temp" for Windows installations, which I have just added. Although file_cache_fallback=1 is the default on Windows, it apparently has no effect if file_cache is undefined.

cmb69 commented 2 years ago

Yes, you need to set opcache.file_cache explicitly, but I don't think this will help in your case. If attaching to the shared memory fails, and there is no file_cache, the script is run without OPcache. Whether file_cache brings a considerable performance improvement, on Windows at least, depends on the script. When I checked several Web apps a while ago, the performance improvement with file_cache_only was disappointing (perhaps due to the generally slow file access on Windows).

And no, you can't set opcache.cache_id during runtime (via ini_set()), since it is PHP_INI_SYSTEM and needs to be this way, since attaching to OPcache shared memory happens before any scripts are run. It is generally possible to read the value from an environment variable, or to set INI settings via the registry (for certain folders). That might be a viable option for you.

Any other OPcache setting is not affected by opcache.cache_id. Regarding opcache.memory_consumption=512; that sounds like a pretty large value; maybe you can get away with far less, especially when you have individual caches? Maybe you can add some logging to the scripts which shows the actual consumption (via opcache_get_status()).

jjdunn commented 2 years ago

Cristoph - thanks again for your help and advice, and very prompt replies. Much appreciated !

Regarding opcache.memory_consumption=512; that sounds like a pretty large value; maybe you can get away with far less

I just installed https://github.com/amnuts/opcache-gui and you are correct: 512MB is far too much memory. Our production server has been up since 25-Dec (more than two weeks) and is using < 55MB of the cache, with a ~100% hit-rate on more than 1000 files. see attached screenshot. opcache-usage

It is generally possible to read the value from an environment variable, or to set INI settings via the registry (for certain folders).

All of the CLI processes have the same entry-script, and all the CLI code is in the same directory, so a registry-entry probably would not work to create separate caches.

However, the CLI processes are launched by the Windows task-scheduler, which executes .bat files, so perhaps setting the opcache.cache_id as an environment-variable separately in each .bat file is the way to go. I'm not sure of the exact syntax for that : I see I can do export <VAR>=<value>, but what syntax causes it to be automatically read as a php.ini value when the entry-script runs?

As I mentioned previously, the CLI processes share quite a lot of code with the web-app, since they're both using the same framework and models, although different controllers and views. My intention was to have one large cache for efficiency.

Do you think it may be possible to fix the memory-access-violation, if I can provide the debugging output? Or should I assume it will be too difficult, and begin implementing the separate-cache for each CLI process ?

The fallback for all of this would be to disable opcache for CLI, but given that these processes execute 2000+ times/day, it seems woefully inefficient to do so.

cmb69 commented 2 years ago

Thanks for the detailed reply!

I'm not sure of the exact syntax for that

If you're starting the PHP scripts from batch files, you could pass the cache_id as command line option:

php -d opcache.cache_id=%ID% …

The %ID% can be hardcoded there. Or you could use some tool like uuidgen to generate a UUID on the fly. Or maybe some powershell script.

My intention was to have one large cache for efficiency.

Yes, that's reasonable. However, if the actual issue is really caused by different memory layout due to ASLR[1], that just won't work. The only way a single OPcache instance could reliably work for the CLI would be to have a single long-running CLI process (basically a service) which actually executes the different tasks; to be able to run multiple tasks in parallel, you would need to use an extension which allows to do so in a single process (e.g. parallel).

[1] I'm not sure whether this is actually the reason, but we faced the same issue when the PHP test runner was improved to run tests in parallel, and the solution we came up with was actually the opcache.cache_id which solved the issue nicely.

KapitanOczywisty commented 2 years ago

@cmb69 How about spawning fastcgi and using fast-cgi-client to run tasks? Could this solve OP problems?

cmb69 commented 2 years ago

@KapitanOczywisty, FastCGI has exactly the same issues. On POSIX systems this is solved by FPM, which forks its workers to enforce the same memory layout. There is nothing like that for Windows, though (because Windows does not support forking). Apache mpm_winnt comes close, but still may have that issue; PR #7865 might solve that.

jjdunn commented 2 years ago

ok - this is helpful. (BTW - so far it has not crashed again. I will post the opcache-debug output as/when it does).

Per the previous comment, I've reduced the opcache.memory_consumption from 512 to 80.

One more question: does it reserve this much memory? In other words, if I have a web-app, plus 22 CLI processes with separate opcache_id's, will I be reserving 23 * 80 = 1840MB (nearly 2GB) of memory? Or is the memory_consumption merely an upper-limit, after which it switches to the file-cache ?

jjdunn commented 2 years ago

one more thing:

The only way a single OPcache instance could reliably work for the CLI ....

In point of fact, since opcache.cache_id is not currently specified, the web-app + 22 CLI processes are sharing the cache: I can see the cache-hits from all the CLI processes, as they run, in the opcache-log. There are many files used by both web-app and CLI which are loaded from the memory cache.

So is the key word here "reliably" ? Because up till now (at least on our old production Win 2012 R2 server); and for the past few weeks on Windows Server 2019, it is in fact working....just not 100% reliably...hence this ticket.

cmb69 commented 2 years ago

OPcache instances are not supposed to shared between different SAPIs. Actually, each combination of user/cache_id/SAPI/system_id, where the system_id is unique for each PHP version plus some further details, has its own ID which is hashed. Since we're using MD5 hashes, there is a chance of collisions, but that should be pretty low. You can double check that when file_cache is enabled for Web and CLI; there should be at least two different sets of caches, even if you use the same basedir for the file_cache.

jjdunn commented 2 years ago

You can double check that when file_cache is enabled for Web and CLI; there should be at least two different sets of caches, even if you use the same basedir for the file_cache.

I'm not seeing that.

in php.ini on my non-production system:

zend_extension=php_opcache opcache.enable=1; opcache.enable_cli=1; opcache.error_log="c:\windows\temp\opcache.log"; opcache.file_cache="c:\temp\opcache"; opcache.file_cache_only=0;

Note: there is no opcache.cache_id defined in php.ini

the .bat file for CLI process #1: php -d opcache.cache_id=%cli-cache% <entryScript.php> <processName1> <entryPoint1> <arguments>

the .bat file for CLI process #2: php <entryScript.php> <processName2> <entryPoint2> <arguments>

I restarted apache, and then opened the web app; and executed both of the .bat files

EDIT: what I wrote originally was partly incorrect. I needed to actually STOP and then restart apache for the new cache configuration to take effect.

Now in c:\temp\opcache\ there are TWO subdirectories:

\43bf8cd1c5d522307bb298f22c2b7dc6\6017a1d78e41dff04681efbe1c05c242\C\web\app\...etc...\ contains the cached-files for the WEB application

\9ddbaf9b6188aa510f3b7eccc9d5c82c\6017a1d78e41dff04681efbe1c05c242\C\web\app\....etc....\ contains the cached-files for BOTH of the CLI applications. <<<=== this does not seem like the expected result.

cmb69 commented 2 years ago

Now in c:\temp\opcache\ there are TWO subdirectories:

\43bf8cd1c5d522307bb298f22c2b7dc6\6017a1d78e41dff04681efbe1c05c242\C\web\app...etc...\ contains the cached-files for the WEB application

\9ddbaf9b6188aa510f3b7eccc9d5c82c\6017a1d78e41dff04681efbe1c05c242\C\web\app....etc....\ contains the cached-files for BOTH of the CLI applications. <<<=== this does not seem like the expected result.

Hmm, I cannot reproduce that:

C:\php-sdk\phpdev\vs16\x64\php-src-8.0
$ type ..\gh7915.php
<?php

echo "hello world";

C:\php-sdk\phpdev\vs16\x64\php-src-8.0
$ dir x64\Debug\test_file_cache
 Datenträger in Laufwerk C: ist Boot
 Volumeseriennummer: A476-BF43

 Verzeichnis von C:\php-sdk\phpdev\vs16\x64\php-src-8.0\x64\Debug\test_file_cache

11.01.2022  14:27    <DIR>          .
11.01.2022  14:27    <DIR>          ..
               0 Datei(en),              0 Bytes
               2 Verzeichnis(se), 88.750.170.112 Bytes frei

C:\php-sdk\phpdev\vs16\x64\php-src-8.0
$ nmake run ARGS="-dopcache.cache_id=1 ..\gh7915.php"

Microsoft (R) Program Maintenance Utility, Version 14.29.30138.0
Copyright (C) Microsoft Corporation. Alle Rechte vorbehalten.

          "C:\php-sdk\phpdev\vs16\x64\php-src-8.0\x64\Debug\php.exe" -n -c "C:\php-sdk\phpdev\vs16\x64\php-src-8.0\x64\Debug\tmp-php.ini" -dopcache.cache_id=1 ..\gh7915.php
hello world
C:\php-sdk\phpdev\vs16\x64\php-src-8.0
$ nmake run ARGS="-dopcache.cache_id=2 ..\gh7915.php"

Microsoft (R) Program Maintenance Utility, Version 14.29.30138.0
Copyright (C) Microsoft Corporation. Alle Rechte vorbehalten.

          "C:\php-sdk\phpdev\vs16\x64\php-src-8.0\x64\Debug\php.exe" -n -c "C:\php-sdk\phpdev\vs16\x64\php-src-8.0\x64\Debug\tmp-php.ini" -dopcache.cache_id=2 ..\gh7915.php
hello world
C:\php-sdk\phpdev\vs16\x64\php-src-8.0
$ dir x64\Debug\test_file_cache
 Datenträger in Laufwerk C: ist Boot
 Volumeseriennummer: A476-BF43

 Verzeichnis von C:\php-sdk\phpdev\vs16\x64\php-src-8.0\x64\Debug\test_file_cache

11.01.2022  14:28    <DIR>          .
11.01.2022  14:28    <DIR>          ..
11.01.2022  14:28    <DIR>          d22e3edbd12ec49006b10131735197cc
11.01.2022  14:28    <DIR>          d3a3d426cf9d3efd3f581b1e261daf61
               0 Datei(en),              0 Bytes
               4 Verzeichnis(se), 88.750.133.248 Bytes frei
jjdunn commented 2 years ago

Hmm, I cannot reproduce that:

ah, yes. Sorry it was my mistake:

php -d opcache.cache_id=%cli-cache%

I used your example literally, but did not have a variable %cli-cache% defined.

using a string literal like "2" or "clicache", I get the same results as you. In my previous test, the cache_id variable was undefined, so functionally equivalent to null or empty-string, which explains the previous results.

Sorry for the wasted effort on your part, and for my ignorance of Windows shell syntax.

I think the easy solution for me is to use the -d opcache.cache_id=X option in all my CLI processes, and omit opcache.cache_id from php.ini, so the web app gets the default cache.

This still does not explain php_opcache crashing, as per the subject of this ticket. I've been monitoring for several days with opcache.log_verbosity_level=4; (huge log files); but so far the crash has not happened.

I'll be offline from Jan 12 to Feb 12, so I'll return to this topic in a month.

Thank you again very much for your time, attention, expertise, and patience. :-)

jjdunn commented 2 years ago

back online. problem occurred once on Jan 12; and again today Feb 14. Today's instance involved five CLI processes running concurrently. each process creates two dump files. In 10x dumps, the error is always at accel_new_interned_strings+66.

looks like adding the opcache.file_cache configuration helped some; but the failures continue. Possibly due to multiple concurrent CLI processes?

I have not yet tried adding the -d cache-id as suggested previously. Is that the next best thing to do ? I'm happy to continue trying work-around, but I would (very respectfully) point out that there is a bug somewhere which is causing a crash....anything I can do to help troubleshoot I'm happy to help with, keeping in mind I'm not a PHP core developer and have no ability to build or debug PHP locally. Nor can I reproduce this problem on demand - it only happens on a loaded production server.

p.s. I'm happy to post additional stack-traces (derived from using DebugDiag2 on the .dmp files). and/or any other artifacts, if that would be helpful.

dca00 commented 2 years ago

If it helps anyone, I had the same issue and solved it by cleaning up c:\windows\temp

dynasource commented 1 year ago

@cmb69, Has something been changed in 8.1 with respect to OPcache sharing in Windows CLI?

My concurrent CLI processes crash with the release of 8.1. Before, in 8.0, up to 8.0.27 we never had this problem. Windows event viewer shows this:

Faulting application name: php.exe, version: 8.1.14.0, time stamp: 0x63b570c0 Faulting module name: php8.dll, version: 8.1.14.0, time stamp: 0x63b5798e Exception code: 0xc0000005 Fault offset: 0x0000000000192531 Faulting process id: 0x0x6A68 Faulting application start time: 0x0x1D9323CF3D73A68 Faulting application path: D:\bin\php\x64\nts\php.exe Faulting module path: D:\bin\php\x64\nts\php8.dll

The issues are gone when using opcache.enable_cli=0 or by using pcache.cache_id However, this is not favorable for the reason of performance you likely mentioned in https://github.com/php/php-src/issues/7915#issuecomment-1008167635 We are losing 30% performance doing phpunit tests in parallel, which to us is kind of a big deal (even delayed adopting 8.1 for this reason).

Is there something that can be done to revert to the old behavior?

cmb69 commented 1 year ago

Has something been changed in 8.1 with respect to OPcache sharing in Windows CLI?

I don't think there has been anything specific wrt. Windows (CLI). Maybe your problems are caused by the inheritance cache which has been introduced. There might be bugs to fix, but without some way to reproduce these issues, there's probably not much we can do.

The issues are gone when using opcache.enable_cli=0 or by using pcache.cache_id However, this is not favorable for the reason of performance you likely mentioned in #7915 (comment)

Disabling OPcache (or only enabling file cache) can cause serious performance degradation; opcache_cache_id should not (unless there is not sufficient memory available; each cache_id get's its own SHM). So I suggest you try with different cache_ids.

Generally, I'm not happy with the OPcache re-attaching behavior (which is only implemented on Windows). On other systems, unrelated CLI processes always have their own OPcache instances. The re-attaching on Windows was implemented to be able to share a single cache for mutliple (F)CGI processes, but that has issues (i.e. possibly failing to work, so file_cache_fallback is a thing). And in the case of simulating fork(2) on Windows, sharing the same cache also makes sense. Anyhow, the way OPcache is designed, it can't really work great on Windows due to the lack of fork(2) combined with strong ASLR. I'm afraid there is no general solution on Windows; avoiding re-attaching should solve a couple of issues, but would impose limitations which probably could only be resolved by using multiple threads on Windows (instead of fork(2) on other systems), but that is not necessarily good wrt. stability (one crashing thread will kill the whole process including all other running threads).

dynasource commented 1 year ago

@cmb69 thanks for your fast reply.

The reproduction of the crash over here is quite easy in PHPstorm:

This process can be repeated with the same result every time. So it doesn't crash during the first run but always at the second run. It also always seems to crash at the same test so the reproduction seems to be consistent.

Update: Further investigation shows that the crash originates in calls to PDO's sqlite::memory: Any ideas what has changed between 8.0 and 8.1 on this matter?

cmb69 commented 1 year ago

The reproduction of the crash over here is quite easy in PHPstorm:

If it would generally happen, there certainly have been a lot of respective reports. I assume that this is specific to your production and/or test code, or possibly your machine or PHP configuration.

Further investigation shows that the crash originates in calls to PDO's sqlite::memory:

Do you call that directly (new PDO(":sqlite::memory")), or do you use a wrapper which inherits from PDO? If the latter, that might indeed be caused by the inheritance cache.

Also condider to clear the file cache, or even to disable it via configuration; at least for testing purposes.

dynasource commented 1 year ago

I digged it bit further, it seems the PDO was not called yet (sorry), but a file prior to that (a query building file named SelectQuery.php).

The process crashes when I call one of its methods. When I comment that out, the tests run fine (having that other php instance open). I can rerun this without problems.

However, when I make a change in that particular SelectQuery.php file, like adding '//' somewhere, everything crashes again. I have to kill that other PHP process to make it all running a again succesfully. Also when a new php instance is started, everything works repeatly fine.

So during runtime something is going on with that particular file and mutating it does have consequences, perhaps because of the new Inheritance cache.

dynasource commented 1 year ago

interestingly, when I mutate the ancestor root file (adding // somewhere), the 0xC0000005 disappears and everything runs smooth again

So we have an inheritance tree like this: SelectQuery > ClausesQuery > Query;

Mutate behavior by adding //: SelectQuery : -> tests and reruns crash with 0xC0000005 ClausesQuery : -> tests and reruns crash with 0xC0000005 Query: -> tests suddenly run OK again (some tree reset?)

dynasource commented 1 year ago

good news, we have found a workaround to this bug, thanks to @cmb69 pointing to looking into the inheritance cache. Highly appreciated.

Before running multiple processes in parallel, we run a single php file in which we loop the composer classmap (composer/autoload_classmap.php) and execute a class_exists on the classes.

With this php instance running in the background, we are able to run our tests in parallel again. Jeej.

gasperj commented 1 year ago

Regarding the opcache.cache_id property - I run all our Windows IIS sites under different user accounts for the apppools. So, would it still apply in this configuration to specify a different cache_id for each site? Thank you.

; Facilitates multiple OPcache instances per user (for Windows only). All PHP ; processes with the same cache ID and user share an OPcache instance. ;opcache.cache_id=

; Enables and sets the second level cache directory. ; It should improve performance when SHM memory is full, at server restart or ; SHM reset. The default "" disables file based caching. opcache.file_cache="D:\PHPopcache\${APP_POOL_ID}\"

; Implies opcache.file_cache_only=1 for a certain process that failed to ; reattach to the shared memory (for Windows only). Explicitly enabled file ; cache is required. opcache.file_cache_fallback=1

bertasoft commented 1 month ago

I've detected the same problem when PHP use a symlink and when PHP use the same file between different sites. eg.: site1, site2, siteN, use the c:\common\shared.php