matthiasmullie / scrapbook

PHP cache library, with adapters for e.g. Memcached, Redis, Couchbase, APC(u), SQL and additional capabilities (e.g. transactions, stampede protection) built on top.
https://www.scrapbook.cash
MIT License
315 stars 27 forks source link

Collection or server? #16

Closed mindplay-dk closed 7 years ago

mindplay-dk commented 7 years ago

Hi Matthias,

Please take a look at this comment I posted for the PSR-16 review.

I have the same issue/question regarding Scrapbook.

For example, Scrapbook's Memcached-adapter takes a Memcached instance via the constructor - and while I could use Memcached::OPT_PREFIX_KEY with setOption() prior to constructing the adapter, the implementation calls Memcached::flush(), which would flush the entire server, not just a single collection, so while the issue of key-collisions could be addressed that way, this approach does not effectively result in truly having multiple collections.

Of course there are cases where you wish to clear the entire cache (typically on deployment) but there are certainly cases where you wish to clear only a collection.

Thoughts?

matthiasmullie commented 7 years ago

In my opinion, prefixes/collections/... are misuses, to store different kinds of data in the same server, even though they really belong in different cache servers. Similar to how prefixing database tables was once widespread when most shared hosting providers only provided 1 DB. And running multiple cache servers is also becoming increasingly easy.

And if so desired, multiple different cache instances could be initialized (like how you suggested Memcached::OPT_PREFIX_KEY). Any shortcomings or failures to isolate multiple sources of data on 1 server are on the backend: Memcached doesn't support flushing only a subset of data, neither does APC, ...

But even if I disagree with its usage, I recognize that people do use/want it. I'm also confident that it must be possible to come up with clever approaches to overcome the shortcomings on the backend. I'll look into it (in a separate branch: https://github.com/matthiasmullie/scrapbook/tree/namespace) and I'll report back.

Currently, I'm exploring adding a "setNamespace($namespace)" method to all KeyValueStore implementations, and this would be the behavior:

// in main namespace
$cache->set('key', 'value');
$cache->get('key'); // returns "value"

// in other namespace
$cache->setNamespace('ns');
$cache->get('key'); // returns false
$cache->set('key', 'other-value');
$cache->get('key'); // returns "other-value"

// back to main namespace
$cache->setNamespace();
$cache->get('key'); // returns "value"

// $cache->flush() inside a namespace would clear out everything in that "namespace".
// $cache->flush() in global namespace would clear out everything, including inside other namespaces.

What do you think? Alternative ideas are also welcome.

mindplay-dk commented 7 years ago

Well, first off, it sounds like Larry Garfield and Jordi Boggiano agreed with me, that the PSR-16 abstraction needs to be defined as representing a collection, not a server.

I agree that prefixes are misuse, but that does not imply that collections are necessarily misuse - with a file-based cache, for example, you can simply point to different root-folders.

So I'm not sure if your KeyValueStore represents a collection or a server - but currently, it appears to represent a collection, even if for e.g. Memcached, that collection also happens to be a server, but you seem to agree that prefixes are misused, so all in all, what you have currently seems to align well with PSR-16.

Adding support for namespaces, on the other hand, won't align well with PSR-16, but may align better with PSR-6, not sure. PSR-16 at least is a collection-abstraction, so I would prefer not introducing the concept of namespaces, which does seem like misuse, if it's built-in as a feature that somehow applies to all types of back-ends; because, in reality, it doesn't.

In my opinion, mutable namespaces is a particularly bad idea, because that literally means you can change collections mid-stream, which seems really wrong. I don't think the namespace, if introduced, should be mutable state, but I'd prefer it didn't exist at all.

Instead, I'd prefer to have, for example, an extended adapter for each applicable back-end, since multi-collection support can be optimized in various different ways for different back-ends, and since some back-ends (file system, at least) have multi-collection support "naturally" without having to implement anything.

So for, say, Memcached, I could picture having a MemcachedCollection adapter that extends Memcached with a multi-collection key-strategy, e.g.:

$connection = new \Memcached();
$connection->addServer('localhost', 11211);

$article_cache = new MemcachedCollection($connection, "articles");
$session_cache = new MemcachedCollection($connection, "sessions");

The second argument here is a collection ID, which might be used internally as a prefix, or a seed for hashing, whatever makes sense for the back-end.

So this adapter would implement collection-support in the optimal way for Memcached - other adapters might use different strategies, some might not support multiple collections at all if it doesn't make sense for that back-end. Other adapters might need to do anything special to support multiple collections at all, such as a file-system adapter, which might just have a constructor signature like (string $root_path, string $collection_name = null) allowing an optional collection-name to designate a sub-folder under the given $root_path.

I think that solving this individually for each back-end is a better approach than trying to solve it systemically for all types of back-ends, because, even if that worked, a common strategy to mitigate key-collissions would have horrible performance-characteristics if applied to, say, a file-system adapter, which can use the much faster approach of simply using multiple sub-folders.

matthiasmullie commented 7 years ago

I also didn't really like the mutable namespace, but was the best I could come up with at that point.

I'm also not a fan of constructing yet another brand new set of classes for these collection. In part because it would make it impossible to reliably do server-wide flushing for all adapters (e.g. a MemoryStore "server" would not know about other "MemoryStore" collections that it would have to wipe out - or would it not have to - confusing...)

What I'm working on now is very similar. It also created a new KeyValueStore object for a specific namespace/collection/prefix/whatever (not mutable), but it's created by the server object (which can keep track of it's collections this way) It looks like this:

$connection = new \Memcached();
$connection->addServer('localhost', 11211);

$server = new MatthiasMullie\Scrapbook\Adapters\Memcached($connection);
$article_cache = $server->collection("articles");
$session_cache = $server->collection("sessions");

// all of these are different keys
$server->set('key', 'value one');
$article_cache->set('key', 'value two');
$session_cache->set('key', 'value three');

// this removes 'value two'
$article_cache->flush();

// this removes everything ('value one' and 'value three')
$server->flush();

Does an implementation like this make sense?

mindplay-dk commented 7 years ago

Looks a lot like what I had in mind.

What comes out of $server->collection("articles") is the actual PSR-16 instance then, right? (or the instance you can decorate with a PSR-16 adapter more likely?)

matthiasmullie commented 7 years ago

What comes out of there would be a KeyValueStore instance, which you can then wrap inside a PSR-6 or PSR-16:

$connection = new \Memcached();
$connection->addServer('localhost', 11211);
$server = new MatthiasMullie\Scrapbook\Adapters\Memcached($connection);
$collection = $server->collection('articles');
$simplecache = new MatthiasMullie\Scrapbook\Psr16\SimpleCache($collection);
mindplay-dk commented 7 years ago

Appears to make sense :-)

mindplay-dk commented 7 years ago

One note, the $server->collection() call is actually a call to a factory-method, correct? You would get a new instance every time? So the name should likely include a verb, e.g. createCollection().

matthiasmullie commented 7 years ago

It will not necessarily create a new instance every time. MemoryStore (probably the only one) can return the same instance if the same "collection" is requested again, or it wouldn't be able to access what's in that "collection" already (since it's inside the object itself). So I'm thinking getCollection() looks best?

I'm also not 100% sure whether to call these things "collections" or something else. "Collections" seems best at this point though, since some are prefixed (APC, Memcached), others are a different database (Redis), ... "Collections" seems like a pretty good description that captures all methods of having a subset of data.

mindplay-dk commented 7 years ago

I'm not sure why a memory-based store would even provide collection support? Every instance presumably is automatically a new collection - there's no reason to name them.

There's never (for any store) any reason to create multiple instances representing the same collection. Doing so could lead to problems if a collection isn't stateless, if for example it has some internal (accidental) state like a buffer.

Creating multiple instances of the same collection would seem to indicate a problem with your dependency management - like not using dependency injection. If everything is injected top-down, nothing should ever depend on anything other than the (PSR-6, PSR-16, built-in) collection-abstractions.

matthiasmullie commented 7 years ago

Collections in MemoryStore are pretty useless in theory. But I also want to support them there, mostly because MemoryStore makes for a great cache backend to run tests against: there are no dependencies & response is immediate. While you would never create collections in MemoryStore, you may want to do it in another store (for production), but test that code with a MemoryStore.

I agree with the other statement though. There's probably no good reason to make reusing a collection easy. I did it to have the same experience for MemoryStore as you would when you create a collection for another server-backed store: even though it's a new object, it represents the same data on the server. But there's probably no valid reason to support that; it definitely shouldn't be encouraged to just keep fetching the same collection over and over again.

mindplay-dk commented 7 years ago

it definitely shouldn't be encouraged to just keep fetching the same collection over and over again.

Exactly.

createCollection() is explicit about the fact that you're getting a new collection-instance.

getCollection() might imply a level-1 cache inside the object, providing the same instance for the same collection, which could be problematic, since this would mean all collections live forever inside that object and can never fall out of scope and get released.

(it's not something PHP devs are used to thinking much about, but long-running apps under php-pm or socket-servers etc. are likely to become more common in the near-future, as the importance of real-time features on websites increases...)

mindplay-dk commented 7 years ago

If it's useful, we released this today, specifically for integration-testing:

https://github.com/kodus/mock-cache

matthiasmullie commented 7 years ago

You've won me over on how getCollection should behave, but I'm still not sold on the name createCollection. I feel like "create" can also imply that it's "creating" something new on the server, like a sandbox. But the values you get from there are the one's you may already have stored earlier. I think "create" may set the wrong expectation there.

Naming things is always hard. I fail to come up with a better name than get or create, one which clarifies that we're creating a new object, without implying anything about what it does on the server (implementation can vary between adapters anyway) and still being memorable (not something crazy like getNewInstanceForCollection or so)

I'm going to update the docs to make it very clear what is happening:

    /**
     * Returns an isolated subset (collection) in which to store or fetch data
     * from.
     *
     * A new KeyValueStore object will be returned, one that will only have
     * access to this particular subset of data. Exact implementation can vary
     * between adapters (e.g. separate database, prefixed keys, ...), but it
     * will only ever provide access to data within this collection.
     *
     * It is not possible to set/fetch data across collections.
     * Setting the same key in 2 different collections will store 2 different
     * values, that can only be retrieved from their respective collections.
     * Flushing a collection will only flush those specific keys and will leave
     * keys in other collections untouched.
     * Flushing the server, however, will wipe out everything, including data in
     * any of the collections on that server.
     *
     * @param string $name
     *
     * @return KeyValueStore A new KeyValueStore instance representing only a
     *                       subset of data on this server
     */

I think this makes it reasonably clear what's happening?

mindplay-dk commented 7 years ago

I think this makes it reasonably clear what's happening?

Yes :-)

On the terminology, I think create is pretty much the de-facto standard when it comes to factory-methods. Yes, it's a bit ambiguous in this case, but I wouldn't worry too much about it. Even "create new instance" could imply "on the server", so I don't think there's any practical way to deal with this ambiguity.

But you only need to read the doc-block once to settle any confusion and realize that this is just a factory-method, like most methods named "create", so I don't think it's too bad. Trying to come up with a different name might just make it even more confusing...

matthiasmullie commented 7 years ago

Collections have been merged & released (in 1.4.0)

I've stuck with getCollection: there's less ambiguity about what's going on (all other methods are about what goes on on the cache server), and getCollection doesn't exactly act like a standard factory method anyway.

Thanks for this report!

mindplay-dk commented 7 years ago

I personally would not have modeled the namespace as a mutable property of adapters, as this enables you to effectively change the collection an instance represents.

Also, I wonder if you benchmarked this? The impact on the APC adapter, for example, could be pretty high - clearing a collection is going to take much longer, now that it has to search through keys.

I'm not sure if you misunderstood what I was suggesting, or if this was just your preferred approach - what I was suggesting was not a common factory-interface for the creation of cache-collections, I was actually trying to suggest it be implementation-specific, since some back-ends are going to support it directly, some not at all, and some are going to support it naturally (file-based) without adding a collection-name concept, e.g. root-path would suffice for a file-based cache, and also changing the namespace would effectively change the root-folder, which could be risky or error-prone.

Some people might well want a single, dedicated server for a highly-optimized single APC collection. I would expect to provide a collection ID to an adapter implementation that supports the concept of multiple collections.

Well, that's how I was thinking about the problem, but I'm not sure how that idea would fit with your framework. It's probably fine either way, I just wanted to clarify this is not actually what I was suggesting - I can see your approach having some advantages in terms of consistency and bootstrapping (ease of use) but some potential drawbacks in terms of complexity and overhead. (I tend to favor simplicity over convenience, but that's a personal preference.)

matthiasmullie commented 7 years ago

The namespace is not a mutable property of adapters. You can get an object representing a collection from the object representing the server, but both will be distinct objects. You can get objects for multiple collections, but they'll all be distinct objects, and the namespace is not mutable.

// $cache is a KeyValueStore implementation
$cache = new MatthiasMullie\Scrapbook\Adapters\Apc();
// flush on the "server" still does what it used to;
// in the case of APCu: apcu_clear_cache
$cache->flush();

// from the object representing the cache server, we can
// fetch an object representing a subset (=collection)
// $collection is also a KeyValueStore implementation
$collection1 = $cache->getCollection('articles');
// and another one
$collection2 = $cache->getCollection('users');

// flushing a collection will only affect what's inside it;
// in this case: iterate all "articles" keys and remove them
$collection1->flush();

Flushing an APC collection will indeed take a little longer because there is no way to clear a subset of keys in 1 operation. But flushing a "server" still works exactly the same.

Both a "server" and "collection" are KeyValueStore implementations, and can be wrapped inside a PSR-6 or PSR-16 implementation.

$simplecache = new MatthiasMullie\Scrapbook\Psr16\SimpleCache($cache);
$simplecollection = new MatthiasMullie\Scrapbook\Psr16\SimpleCache($collection1);

The implementation of "collections" is specific to each adapter. And it's certainly still possible to use multiple "servers" instead of having 2 "collections". Both are KeyValueStore implementations, so they can be used interchangeably. Only the bootstrapping will be different:

$first = new MatthiasMullie\Scrapbook\Adapters\Memcached($collectionOne);
$second = new MatthiasMullie\Scrapbook\Adapters\Memcached($collectionTwo);

vs.

$server = new MatthiasMullie\Scrapbook\Adapters\Memcached($collection);
$first = $server->getCollection('one');
$second = $server->getCollection('two');

But I believe I understand what you mean. You were thinking about something along these lines:

$someServer = new MatthiasMullie\Scrapbook\Adapters\Flysystem('/some/path');
$someCollections = new MatthiasMullie\Scrapbook\Adapters\Flysystem('/some/other/path');

For filesystem, this happens to work out of the box already. For other adapters, there have to be some differences in functionality (e.g. flush on APCu, Memcached & Couchbase can only be done on server level), so I'd need different classes to differentiate between a "collection" and a "server" for those adapters, because they have to work differently.

Setting it up seemed quite confusing & prone to errors, then, IMO. Having a single method "getCollection" that does it for you seemed to a good compromise :) And the way they work (e.g. APCu having to iterate keys on a collection) wouldn't be different anyway.

mindplay-dk commented 7 years ago

The namespace is not a mutable property of adapters.

I was looking at #17 which at some point had mutable namespaces, looks like that changed? Reviewing what's in master, it makes a lot more sense now - looks like everything I hoped for! :-)

matthiasmullie commented 7 years ago

Yeah I experimented with that for a bit, but you rightfully convinced me that was not the best approach ;)