krakjoe / apcu

APCu - APC User Cache
Other
960 stars 196 forks source link

apc_fetch returns copy of cached values, could not another behaviour be added ? #175

Open archibaldhaddock opened 8 years ago

archibaldhaddock commented 8 years ago

Hello everybody !

I use apcu to store big arrays that cost a lot to produce from xml config files. These arrays are pure configuration data : my source only use them in a ready-only way. I suppose that this is a very common use of the apcu extension.

Moreover the content of these big arrays is the same for a lot of sites served by the same php-fpm process. Which means that where I could have only one instance foreach configuration array I've got hundreds...

My questions are :

1) would not it be great to implement a additionnal kind of apc_fetch that would not return a copy of the cached values but a reference to them ? (Strict read-only use of the returned references would be the coder responsability) Looking into the apcu code, would it be enough to duplicate apcu_fetch to apcu_fetch_read_only and return zval from apc_cache_find instead of the deep-copied zval ?

2) Another way to avoid the current useless copy ( and allocation ) of tons of data would be to return something (whatever the inner zend implementation) that would follow the usual copy-on-write behaviour.

$a and $b being arrays, $a = $b won't deep-copy $b into $a. Why should $a = apcu_fetch($key); ?

I post a similar question on HHVM github : https://github.com/facebook/hhvm/issues/5167 And I have experimented (with xhgui) that their answer is correct : HHVM actually implements option 2)

Regards

krakjoe commented 8 years ago

We can't change the normal ACP(u) API, or Zend ... but for PHP 7 there seems to be a way to avoid copying (some) arrays out of SHM ...

I don't know when/if I'll have time to work on it ... if anyone else wants to have a go, feel free, and comment here ...

Some thoughts ...

It will need to have a new API, suggest ?

void apcu_persist(array $thing);
void apcu_fetch_persistent(array $thing);

It will need to make sure that there are no serialized objects in the array also, an exception should be thrown if the array contains data that cannot be persistent ...

staabm commented 8 years ago

The copy-on-write semantic fetch is not an option (like suggested in the initial post)?

archibaldhaddock commented 8 years ago

For a new API name, If I'd rather choose something coherent with the existing functions ? like, say : apcu_store_persistent & apcu_fetch_persistent...."Persistent" may or may not be the proper qualifier : the main distinction with the original api is that the stored arrays wont be copied. Cached values are already persistent, aren't they ? Maybe "apcu_store_nocopy/apcu_fetch_nocopy" ?

As we are talking about new features, I was wondering if some kind of control on the unicity of contents could be added (extra parameter, new api ?). If I selfishly get back to my own setting : I have, for example, 10 sites served by the same php-fpm. Each one of them uses 10 different huge config arrays, from $array_1 to $array_10. These arrays are the same for every site ($array_x != $array_y but $array_x for site 1 = $array_x for site 2).

This situation is likely to happen on a lot of mutualised settings.

What if for the same content only one instance is maintened in apc shared memory ? Of course this could be obtained in php code with something like this (please see nxtweb_cache_get_by_content and nxtweb_cache_set_by_content)


<?php 

define ("NXTWEB_CACHE_APC", 0);
define ("NXTWEB_CACHE_APCU", 1);
define ("NXTWEB_CACHE_NONE", 2);
define ("NXTWEB_OPTIMISATION", 1);

$nxtweb_cache_backend = ( !defined(NXTWEB_OPTIMISATION) ? 2 : ( function_exists("apc_fetch") ? 0 : ( function_exists("apcu_fetch") ? 1 : 2 ) ) ); 

function nxtweb_cache_set($key, $value)
{
    global $nxtweb_cache_backend;

    switch ($nxtweb_cache_backend) {

    case NXTWEB_CACHE_APC : 
            apc_store($key, $value);
            break;
        case NXTWEB_CACHE_APCU :
            apcu_store($key, $value);
            break;
        default :
            return false;
    }

    return true; 
}

function nxtweb_cache_set_by_content($key, $value)
{
    //get the md5 of the serialized version of $value
    $key_from_content = md5(json_encode($value));
    $ret = nxtweb_cache_get($key_from_content);
    if (!$ret)
    {
        if ( !nxtweb_cache_set($key_from_content, $value))
            return false;

    }
    if (!nxtweb_cache_set($key, $key_from_content))
        return false;

    return true;
}

function nxtweb_cache_get($key)
{
    global $nxtweb_cache_backend;

    switch ($nxtweb_cache_backend) {

        case NXTWEB_CACHE_APC : 
            return apc_fetch($key);
        case NXTWEB_CACHE_APCU :
            return apcu_fetch($key);
        default :
            return false;
    }
}

function nxtweb_cache_get_by_content($key) 
{
    $key_from_content = nxtweb_cache_get($key);

    if (!$key_from_content)
        return false;

    return nxtweb_cache_get($key_from_content);    
}

?>

But of course, the control of content could be achieved much faster by the C code of the extension itself.

TysonAndre commented 8 years ago

For PHP7 (with opcache enabled) or HHVM, @archibaldhaddock might be interested in the below link to solve their problem, if writes are infrequent, and all of the classes used are infrequent/ have __set_state implemented:

https://web.archive.org/web/20170608185145/https://blog.graphiq.com/500x-faster-caching-than-redis-memcache-apc-in-php-hhvm-dcd26e8447ad?gi=1c30f8eb7302

TysonAndre commented 8 years ago

I'm sort of interested in this, since I have a similar use case of fetching a lot of arrays and scalars from a large config. Currently, it's dealt with by splitting the config up by array key.

If this code was for arrays and scalars only, it might be possible with adding IS_TYPE_IMMUTABLE and removing IS_TYPE_REFCOUNTED. I'm not sure if there are any other type flags to set/unset.

On apcu_store_persistent:

  1. Check if everything is arrays and scalar types (long, double, null, boolean, string)
  2. From the original array, create an array where IS_TYPE_IMMUTABLE is set on that array, and all of the child nodes (Forget if that includes strings). Use allocation in shared memory instead of emalloc to ensure that this permanent array outlives the request. (For any array or string allocations)
  3. When the array is requested, return the permanent array. The IS_TYPE_IMMUTABLE bit should prevent it from being garbage collected.

There are going to be race conditions when replacing/removing those values. The permanent array would have to be kept around until the last request to use them was finished.

And this would have to write the equivalent functionality to destroy the arrays in strings, but use the free method for shared memory apc pointers for all of those.

Edit: Shared memory allocation/deallocation routines for persistence, not malloc()

https://wiki.php.net/phpng-int#cell_format_zval

Except for types itself, the engine defines few type flags to uniform handling of different data types with similar behavior.

  • IS_TYPE_CONSTANT – the type is a constant (IS_CONSTANT, IS_CONSTANT_AST)
  • IS_TYPE_REFCOUNTED – the type is a subject for reference counting (IS_STRING excluding interned srings, IS_ARRAY except for immutable arrays, IS_OBJECT, IS_RESOURCE, IS_REFERENCE). Values for all refcounted types are pointers to corresponding structures having common part (zend_refcounted). It's possible to get this structure using Z_COUNTD() macro or some data from that structure using Z_GC_TYPE(), Z_GC_FLAGS(), G_GC_INFO() and Z_GC_TYPE_INFO(). It's also possible to access reference counter using Z_REFCOUNT(), Z_SET_REFCOUNT(), Z_ADDREF() and Z_DELREF() macros.
  • IS_TYPE_COLLECTABLE – the type may be a root of unreferenced cycle and it's a subject for Garbage Collection (IS_ARRAY, IS_OBJECT).
  • IS_TYPE_COPYABLE – the type has to be duplicated using zval_copy_ctor() on assignment or copy on write (IS_STRING excluding interned strings, IS_ARRAY)
  • IS_TYPE_IMMUTABLE - the type can't be changed directly, but may be copied on write. Used by immutable arrays to avoid unnecessary array duplication.
archibaldhaddock commented 8 years ago

Thank you TysonAndre for your comments !

I've tried the "include way of the force" a few months ago because I thought that serializing/unserializing was a big part of the problem. Apcu_fetch proved to be better...

But, on which PHP version ? with opcache enabled ? (if not, what I eventually compared was unserializing and parsing) I can't remember so I'm going to test that again on php7 + opcache. If it's ok, what about generalizing this approach to all type of configuration ? Along with the arrays produced from xml files, I'm thinking about tons of parameters that are stored in a mysql database and read on every request...

The new behaviour would be "1) execute a mysql request to test if the parameters table has changed (update_time from information_schema.tables) 2) yes ? produce the corresponding php config file, no ? do nothing 3) always include php config file Of course, opcache must be configured so that it can detect immediatly the changes in the php parameters file...

Thanks again !

TysonAndre commented 8 years ago

Also, if running multiple PHP processes(not threads) on the same server, file based opcache might be worth looking into, but not necessary. It's a configure flag for PHP (Not as only cache).

Opcache memory limit may need to be increased.

If the configs are expected to be small or scalars, APC would probably be better.

There are probably going to be a lot of race conditions. The final file name should be based on the db modification time. A temporary file with a globally unique name should be created, on the same disk partition, written to, and rename()d to the final file(file writes aren't atomic). (And the last modification date cached in APC) Even after that, there are still probably race conditions.

The modification time might change between 1 and 2, for example.

I was recommending it for unchanging (versioned?) configs(v1 never changes, v2 never changes, etc.). If the configs change frequently, the implemented opcache solution is probably going to have more bugs.

nikic commented 5 years ago

Since PHP 7 we have in principle the possibility of directly accessing strings and arrays from SHM without copying them into request memory. The big problem is that in this case we have to ensure that the data remains in SHM until there is nobody left using it.

Opcache solves this problem by never removing anything from the cache until an opcache restart is triggered, in which case it waits until all processes have disconnected form opcache and then kills hanging processes. This is known to be problematic both because it causes an interval of time (while the restart is in progress) where PHP runs without caching, which can cause large load spike, and furthermore it causes instability in the cases where opcache ends up having to kill processes.

What would be relative simple to implement is a mechanism that stores something into apcu forever and which can then be used without copying. But I don't know how to make this work with storage that is supposed to be freed again, while also being robust to abnormal conditions (such as crashed worker processes).

Krinkle commented 3 years ago

@nikic I noticed a recent release changed the serializer to default. I haven't dug into how much the serializer changed, but I'm hoping that maybe this means the fast-access SHM idea you described above became reality. Can you confirm this?

I'm looking forward to adopting this at Wikipedia (via MediaWiki) and reporting on the positive impact it would have on latencies. There's a number of large values we fetch from APCu on all request, and this area took a noticeble perf hit when we switched from HHVM back to Zend PHP. I understand why and we've mostly regained that in other areas, but you can imagine my excitement at the possibility of having both LRU GC and super fast access. ♥️

EDIT: I see the default has since switched back again to php, so I guess we'll hold off for now. But I'd still be interested to know whether or not the default serializer did indeed evolve in the above direction!

nikic commented 3 years ago

@Krinkle Nope, the default serializer still copies values. They're stored in a format that could be directly used if we figured out how to manage invalidation, but currently a copy is still required.

The recommendation for caching large, mostly static data remains opcache.

LionsAd commented 3 years ago

I mean - you could definitely pretty easily implement usage of opcache for the data you want, but yeah invalidation (eg restart every 24 hrs) is then needed.

That said you could easily add a two-tier caching system, where you implement your own LRU by using /dev/shm/cache/bin/cid.php.

The advantage of that is that you can remove items that have not been accessed and hence they would not be reloaded via include_once into SHM after the 24 hrs reset, but for those that are they are loaded super-fast from shared memory file system into opcache, so your opcache rebuilds really fast.

In fact I would probably just add a counter SHM file next to the cache entry. And only load those via require_once into opcache that are needed.

TysonAndre commented 2 years ago

What would be relative simple to implement is a mechanism that stores something into apcu forever and which can then be used without copying. But I don't know how to make this work with storage that is supposed to be freed again, while also being robust to abnormal conditions (such as crashed worker processes).

Yeah, if that were to be done in APCu, having a separate memory area with separate memory limits for data with no expiry that couldn't be deleted would seem more appropriate, but would still have the issues you mentioned. Though at that point, it might make sense to instead

What kind of abnormal conditions/crashes have been seen with opcache? Worker processes that fail to exit while holding a lock?

Possible workarounds:

  1. Integrate into opcache directly as opcache_store_immutable_value(string $key, array|string|float|int|bool|null $value): int and opcache_get_immutable_value(string $key, bool &$success = false): array|string|float|int|bool|null $value - This seems like it'd have somewhat lower memory usage for arrays without serialized objects because it could reuse interned strings more effectively

    Though I don't think a majority of people would have this use case, it'd compete with memory used for compiled files, and the hackish workaround of file_put_contents(fileNameForKey, var_export(..., true)) exists and works with opcache's file cache as well (as mentioned in https://web.archive.org/web/20170608185145/https://blog.graphiq.com/500x-faster-caching-than-redis-memcache-apc-in-php-hhvm-dcd26e8447ad?gi=1c30f8eb7302 )

  2. Create a new PECL extension entirely
  3. I missed it before, but the maintainers had brought up the idea of that functionality to apcu

It will need to have a new API, suggest ?

void apcu_persist(array $thing);
void apcu_fetch_persistent(array $thing);

It will need to make sure that there are no serialized objects in the array also, an exception should be thrown if the array contains data that cannot be persistent ...

so objects, PHP references, and resources?


https://www.npopov.com/2021/10/13/How-opcache-works.html - your post cleared up what opcache is doing - I'd assume apcu is doing something similar.


I'm not actively working on alternatives right now, I'd just decided to research how apcu_fetch worked for string data, and whether it'd be constant time or take additional time/memory.

It turns out it'd take additional time/memory even for plain strings (a new copy is allocated for every call to apcu_fetch, which can be partly worked around by adding a pure php wrapper which calls array_key_exists and uses the original value if it exists before calling apcu_fetch)

static zend_string *apc_unpersist_zstr(apc_unpersist_context_t *ctxt, const zend_string *orig_str) {
    zend_string *str = zend_string_init(ZSTR_VAL(orig_str), ZSTR_LEN(orig_str), 0);
    ZSTR_H(str) = ZSTR_H(orig_str);
    apc_unpersist_add_already_copied(ctxt, orig_str, str);
    return str;
}
TysonAndre commented 1 year ago

The performance for unserializing large arrays with many refcoonted(string/array/PHP reference) values may be improved somewhat by the optimizations mentioned in https://github.com/krakjoe/apcu/issues/323#issuecomment-1288117298 (especially for avoiding the excessive hash table collisions caused by the way pointers are used as indexes in the hash table) - I haven't benchmarked this.

TysonAndre commented 1 year ago

Also, are the maintainers still interested in PRs for adding support for persistent values? I'm looking into whether that's possible.

Another note: If persistent storage APIs were added to APCu that couldn't be cleared, then I belive values in temporary storage could safely use the strings in shared memory (on OSes where shared memory has the same address for all processes attached to shared storage) if those zend_string pointers (or equivalent strings, similar to the way opcache's interned strings storage currently works) were already in APCu's persistent shared storage for other reasons (e.g. keys/values in configs)

TysonAndre commented 1 year ago

It will need to have a new API, suggest ?

Also, are the maintainers still interested in PRs for adding support for persistent values? I'm looking into whether that's possible.

I'd put together a prototype of this. Unit tests pass, but there's no real-world users and there aren't many tests of php-fpm in apcu/immutable_cache, either. This stores immutable arrays and strings (for arrays containing no objects or references) in shared memory, so retrieval is constant time when the serializer/unserializer can be avoided

https://github.com/TysonAndre/immutable_cache-pecl https://github.com/TysonAndre/immutable_cache-pecl#benchmarks

Miscellaneous thoughts on the API design:

Naming thoughts:

Other thoughts for longer-term improvements:

TysonAndre commented 1 year ago

Also, are the maintainers still interested in PRs for adding support for persistent values? I'm looking into whether that's possible.

I'd put together a prototype of this. Unit tests pass, but there's no real-world users and there aren't many tests of php-fpm in apcu/immutable_cache, either. This stores immutable arrays and strings (for arrays containing no objects or references) in shared memory, so retrieval is constant time when the serializer/unserializer can be avoided

I'm planning on putting immutable_cache on PECL in 7 days there are no objections. https://marc.info/?l=pecl-dev&m=166838234522037&w=2 - separately, if any APCu leads want to join as co-leads/maintainers I could add them.

There's multiple leads of APCu, so I'm not sure who to ask. https://pecl.php.net/package/APCu

Bug reporting/investigation would be easier if users could install/uninstall immutable caching separately - Modifying a large existing project with concurrency/serialization/unserialization/locking/allocators can be error prone - if users can reproduce issues they encounter even without immutable_cache/APCu then it'd be easier to debug.

pereorga commented 1 year ago

This looks awesome @TysonAndre. In one of my current projects, the use case is 99% immutable_cache and 1% apcu, so I'll be looking at this for sure.