php / php-src

The PHP Interpreter
https://www.php.net
Other
37.78k stars 7.72k forks source link

PHP checks the filesystem for file existing even if available in opcache and all opcache validations are disabled #13989

Open dkarlovi opened 4 months ago

dkarlovi commented 4 months ago

Description

The following code:

opcache.enable = 1
opcache.enable_cli = 1

opcache.file_update_protection = 0
opcache.validate_timestamps = 0
opcache.validate_permission = 0
opcache.validate_root = 0

opcache.file_cache = "/var/cache/opcache"
opcache.file_cache_only = 1
opcache.file_cache_consistency_checks = 0

will still trigger "File doesn't exist" even though opcache is correctly prepopulated. If I create an empty folder with an empty file, it works as expected.

See (reproduced in Docker, but this is not Docker specific, it's only convenient to reproduce there): https://github.com/dkarlovi/docker-php-opcache-only/blob/main/Dockerfile#L15

Try it like here: https://github.com/dkarlovi/docker-php-opcache-only/blob/main/README.md

PHP Version

PHP 8.2, 8.3

Operating System

Docker PHP image

SakiTakamachi commented 4 months ago

Hi, docker images are not under our manage.

~~Please report it here. https://github.com/docker-library/php~~

SakiTakamachi commented 4 months ago

Sorry, I misread the description. So you just used Docker to reproduce the issue, right?

Has that happened outside of a docker environment? Or haven't you tried that yet?

dkarlovi commented 4 months ago

@SakiTakamachi NP:

So you just used Docker to reproduce the issue, right?

Correct! It's a bit convoluted, but easiest to reproduce in Docker.

I haven't tried it yet outside Docker, no. It should be quite doable, but since you need to tweak opcache settings and have it use the filesystem for storage, it's a bit more of a PITA to do.

For one thing: you'd need to prepopulate the build the opcache using the file on the filesystem, then remove the file and ensure it's not in some cache still, then rerun PHP which shouldn't do a filesystem check. Docker allows us to make this much simpler since it's restarting from scratch in step 2 with artifacts generated in step 1, we know there's nothing on the filesystem.

dkarlovi commented 4 months ago

@SakiTakamachi I did managed to reproduce this without Docker, like so

echo -e "<?php\necho 'hello world';" > test.php

php -dopcache.enable=1 -dopcache.enable_cli=1 -dopcache.file_update_protection=0 -dopcache.validate_timestamps=0 -dopcache.validate_permission=0 -dopcache.validate_root=0 -dopcache.file_cache_only=1 -dopcache.file_cache_consistency_checks=0 -dopcache.file_cache=/tmp test.php
> hello world

rm test.php

php -dopcache.enable=1 -dopcache.enable_cli=1 -dopcache.file_update_protection=0 -dopcache.validate_timestamps=0 -dopcache.validate_permission=0 -dopcache.validate_root=0 -dopcache.file_cache_only=1 -dopcache.file_cache_consistency_checks=0 -dopcache.file_cache=/tmp test.php
> Could not open input file: test.php

# note this only creates the file placeholder, it's empty
touch test.php

# note output here even though the file is now empty, it's (correctly) not being used since opcache is warm - why is the filesystem entry even needed then, it shouldn't be since it's required to use opcache as-is, without validating it
php -dopcache.enable=1 -dopcache.enable_cli=1 -dopcache.file_update_protection=0 -dopcache.validate_timestamps=0 -dopcache.validate_permission=0 -dopcache.validate_root=0 -dopcache.file_cache_only=1 -dopcache.file_cache_consistency_checks=0 -dopcache.file_cache=/tmp test.php
> hello world
iluuu1994 commented 4 months ago

I had a quick look. From what I understand, PHP still tries to open the file because it needs to resolve its path. There might be benefits in trying to resolve the path without looking up the file on the file system.

Also note that include / cli scripts are actually two different code-paths, so this assumption would likely need to be adjusted in multiple places.

dkarlovi commented 4 months ago

Ah, so you're saying it's doing realpath() just to figure out the absolute path and only then checking opcache?

Basically, the file needs to exist just to figure out the absolute path?

iluuu1994 commented 4 months ago

Yes. We might try to normalize the path without using the file system, and only revert to it when the lookup fails. But this might require quite a few changes.

dkarlovi commented 4 months ago

Wouldn't that have a good probability of improving performance in the hot path quite a bit since you'd eliminate expensive I/O when using opcache? 🤔

iluuu1994 commented 3 months ago

Not normally. Once the file is compiled to shm the file is not touched anymore. Repeated opening of the file only happens when using file cache without shared memory (opcache.file_cache_only=1). That is only a recommended configuration for Windows AFAIK (because it's impossible to reliably map the shm region to the same memory address from multiple processes without fork).

dkarlovi commented 3 months ago

@iluuu1994 but the file based cache is technically "shm" too, no?

iluuu1994 commented 3 months ago

Not quite. "Shared memory" or shm commonly refers to memory a block of memory, allocated by your operating system, that multiple processes can access. The file cache without shm simply reads and loads the cache file for every request. This is still very much non-optimal for performance (but definitely better than no opcache).

dkarlovi commented 3 months ago

I know what SHM is, my argument is the file system based shared memory (hence the quotes, "SHM") is serving the same purpose as the real SHM so if the filesystem path is not checked when using the real memory, it shouldn't be using the fake one either.

iluuu1994 commented 3 months ago

I did not say that this is not worth fixing, just that it doesn't normally occur with the normal opcache workflow. When using file cache without shm, you're already opting out of quite a bit of performance.

dkarlovi commented 3 months ago

Maybe you've misunderstood my tone. 😃

It was less "WHY DOESN'T THIS WORK OMG" and more "What's the technical reason the file is required to exist with file based opcache while it's not required with SHM based opcache, assuming both caches are serving the same purpose?" It feels like it should work for both or neither in sync to me, judging from an outsider.

iluuu1994 commented 3 months ago

What's the technical reason the file is required to exist with file based opcache while it's not required with SHM based opcache, assuming both caches are serving the same purpose?

No, you're right, there's none. I'm not disagreeing. :smile: My response was targeted at:

Wouldn't that have a good probability of improving performance in the hot path quite a bit since you'd eliminate expensive I/O when using opcache? 🤔

I answered no, because that's not usually happening. Each file is only touched once for the first request after the server has started, or when using opcache.file_cache_only=1 (which you shouldn't usually enable).

dkarlovi commented 3 months ago

But, if you've said the real path is resolved (and the path needs to exist) before looking up opcache, the same thing is not happening when opcache is in SHM?

How does it resolve the real path on subsequent requests then when in SHM? 🤔

iluuu1994 commented 3 months ago

the same thing is not happening when opcache is in SHM?

It is happening without file cache too, but for more obvious reasons: The file needs to be read and compiled. Once it's in shm, it no longer happens. We use the unresolved filename passed to require/include as a key in a map to the persisted script. So, when you do require 'test.php'; the second time, it is found in the map, and the file doesn't need to be looked up anymore.

How does it resolve the real path on subsequent requests then when in SHM? 🤔

For file cache, we need the whole path to resolve the fully qualified file within the cache folder. If /tmp is used as a file cache folder, we can't assume require 'test.php'; refers to the same file in all instances, as you may be using the file cache for many projects. At least that's my understanding, I have not written or really edited this code. :slightly_smiling_face:

dkarlovi commented 3 months ago

OK, let me recap my understanding of this problem and what we wrote here and you hopefully point out where I got it wrong:

Let's say there's three scenarios:

  1. no opcache
  2. SHM opcache, let's call this shm
  3. filesystem opcache, let's call this fs

No opcache is boring "always redo everything from scratch", we can skip it.

shm

  1. empty starting point
  2. execution for index.php comes in
  3. it gets resolved into /app/index.php
  4. looked up in shm, not there
  5. read from file system, parsed and put into shm
  6. executed
  7. execution for index.php comes in
  8. it gets resolved into /app/index.php
  9. looked up in shm, there
  10. executed

fs

  1. empty starting point
  2. execution for index.php comes in
  3. it gets resolved into /app/index.php
  4. looked up in fs, not there
  5. read from file system, parsed and put into fs
  6. executed
  7. execution for index.php comes in
  8. it gets resolved into /app/index.php
  9. looked up in fs, there
  10. executed

From my understanding, this should be correct? shm should be functionally identical to fs, the only difference is the storage mechanism which shouldn't influence the code path (except where it reads/writes from, of course).

Now, if we remove the file /app/index.php from the filesystem (after either opcache is warmed), you're saying that

  1. shm will still work (because it doesn't need to resolve it into /app/index.php or it doesn't rely on realpath)
  2. fs will not work (because it does need to resolve it into /app/index.php using realpath)

We use the unresolved filename passed to require/include as a key in a map to the persisted script

Would that mean shm has this additional lookup map, but fs does not? Would the solution here be to create the lookup map as part of fs?

If /tmp is used as a file cache folder, we can't assume require 'test.php'; refers to the same file in all instances, as you may be using the file cache for many projects.

But that applies to SHM too, since it's shared (it's in the name)? :thinking: Just as multiple projects can be using /tmp, they can also be using the same SHM.

iluuu1994 commented 3 months ago
  • it gets resolved into /app/index.php
  • looked up in shm, not there

This part is not accurate. On require './foo.php';, PHP looks for the script in ZCSG(hash) using a key constructed from ./foo.php, the include path, and some other variables (see accel_make_persistent_key). Only if you refer to an existing cached file through some other string, e.g. its absolute path, will the file path be resolved again (see persistent_zend_resolve_path).

  • looked up in fs, not there
  • read from file system, parsed and put into fs

Your file cache description is an oversimplification, because without opcache.file_cache_only=1 (which you aren't using), you're also using shm. I.e. when a file gets loaded from file cache, it is then put into shm, and the next request will not read it from file cache again. As mentioned previously, file cache is certainly faster than compiling a file, but still much slower than shm. Hence, file cache essentially only helps with the initial compilation, somewhat mitigating the cache stampede problem when restarting the server.

From my understanding, this should be correct? shm should be functionally identical to fs, the only difference is the storage mechanism which shouldn't influence the code path

Yeah, thsi is not correct. See above. File cache is usually used in addition to shm, not in instead of shm.

Would that mean shm has this additional lookup map, but fs does not? Would the solution here be to create the lookup map as part of fs?

ZCSG(hash) is part of shm, which means you are using it when using file cache. However, when ZCSG(hash) is unprimed, the first lookup of a file is resolving its full path name. Likewise, when not using shm with file cache at all (opcache.file_cache_only=1), looking for the cache file currently involves opening the file to resolve the full path.

Both of these could be improved by trying to resolve relative paths and checking for the cached file first, and only then resorting to opening the file.

Just as multiple projects can be using /tmp, they can also be using the same SHM.

Shm can never be shared in unrelated processes. It's an mmapped, anonymous memory region that is only visible to the current process, and its forks. fork only usually occurs in the SAPIs (e.g. in php-fpm).

dkarlovi commented 3 months ago

without opcache.file_cache_only=1 (which you aren't using), you're also using shm

Sorry, just noticed the original issue had the old INI paste, the reproducer was always correct. I am using opcache.file_cache_only=1 actually.

The idea is being able to put opcache in a PHAR with a static executable shim (or into a Docker image) and ship just opcache, without the original source code, the fs opcache could get promoted into shm, but doesn't need to be in say CLI scenarios where you're only running it once.

dkarlovi commented 3 months ago

when ZCSG(hash) is unprimed, the first lookup of a file is resolving its full path name

This means you could get a stampede to warm this if you're moving from fs to shm after a restart?

Both of these could be improved by trying to resolve relative paths and checking for the cached file first, and only then resorting to opening the file.

OK, then I think I finally understand now, thanks for your patience!

cmb69 commented 1 month ago

[…] Repeated opening of the file only happens when using file cache without shared memory (opcache.file_cache_only=1). That is only a recommended configuration for Windows AFAIK (because it's impossible to reliably map the shm region to the same memory address from multiple processes without fork).

opcache.file_cache_only=1 is a workaround for shared hosting (and maybe other cases). You likely meant opcache.file_cache_fallback=1 which is a Windows only directive, and may help with processes where SHM won't work for the mentioned reasons.