linsomniac / python-memcached

A python memcached client library.
459 stars 202 forks source link

let's please revisit CAS support #16

Open belm0 opened 11 years ago

belm0 commented 11 years ago

I realize this will be controversial given how long the Python memcache API has been around. Relative to the memcached protocol, Python's API doesn't support CAS beyond very trivial use cases.

The Python API forces implementations to manage CAS ID's implicitly. This results in several problems:

CAS ID lifetime management

Since an implementation doesn't know if a CAS ID will subsequently be needed, it must hold on to all ID's indefinitely. Granted, this is not much of a problem for applications with a fixed number of items involved in CAS. However there are use cases which can yield an unbounded number of such items, for example when implementing concurrent collections on memcache using an item per node. (I have https://github.com/google/memcache-collections which implements a deque this way.)

While the API does provide reset_cas(), it is of limited use given that the retained cache ID's are global state. This makes it very hard to combine different libraries employing CAS into one application since they will inevitably have different CAS ID lifetime requirements. Furthermore, the Python memcache implementation makes bugs related to CAS ID lifetime exceedingly hard to detect since it silently succeeds cas() calls for which a CAS ID couldn't be found.

aside: The current API limits CAS use to single reads via gets()-- it isn't possible to use get_multi() with CAS. For this, App Engine extended get_multi() with a for_cas parameter.

Inability to share CAS ID's

A very useful building block for distributed data structures is the MCAS operation, i.e. atomic CAS across multiple, independent addresses. It's possible in general to implement MCAS on top of the single CAS primitive as described in "Practical lock freedom", Keir Fraser, 2004 (http://www.cl.cam.ac.uk/techreports/UCAM-CL-TR-579.pdf). This would be interesting to implement on top of memcache, as it would allow atomic mutation of multiple items, even when they exist on different memcached servers in a cluster. However the lock free algorithm requires other processes to "help" when they find transactions in an intermediate state. To implement this on memcache, the CAS ID's would need to be stored, shared, and used among clients (which may be on different threads, processes, or machines). The memcached protocol supports this, but the Python API does not.

I don't have a proposal yet, but I'd like to investigate extending the Python memcache API to support explicit management of CAS ID's.

linsomniac commented 11 years ago

Does the python-memcached2 interface suit this need better? I'm open to a fix for python-memcached, but I wonder if the new API resolves this issue. Be sure to pick up a release tag, because the code is currently broken as I do some restructuring.

belm0 commented 11 years ago

I acknowledge that python-memcached2 does the right thing, but what I'm needing is for the ubiquitous Python memcache API (implemented by python-memcached, pylibmc, App Engine, etc.) to support explicit CAS ID's. Starting with python-memcached is the logical choice.

To be clear, if I only needed improved CAS support for some one-off Python application I was authoring, python-memcached2 might be fine. Rather, I'm trying to provide libraries making advance use of CAS, and the intended audience would be projects already using the standard memcache API.

linsomniac commented 11 years ago

It sounds to me like applications will need to change in any case to be able to achieve what you are proposing, so targeting projects already using the standard memcache API probably doesn't entirely make sense. Unless you can come up with a way of supporting the CAS changes you'd like while leaving the API unchanged. Consumers would still have to enable CAS usage though, right?

However, I will say that it's my target that python-memcached2 will become the new standard API, because the code is better tested and documented, it's much cleaner, it targets Python 2.x and 3.x, etc... One item I have in the bug list for a 1.0 release is to provide a backwards-compatible shim.

That said, I'm not at all opposed to your idea.

Sean

belm0 commented 11 years ago

Let me give an example: application X is an application using the existing memcache API. For whatever reason (original developers departed, small team, codebase is large, poor encapsulation of memcache usage, poor test coverage, etc.) switching the application to python-memcached2 would take 2 months and is not a good value. That shouldn't preclude the app from implementing some new feature using a client library which provides distributed collections or MCAS on top of memcache.

I haven't thought about how backwards compatible explicit ID's for python-memcached would look, but guessing it may not be elegant. I'll put together some options to consider.

linsomniac commented 11 years ago

I think this is a good example of where the backwards-compatible API I mentioned previously, in python-memcached2 would solve this problem?

belm0 commented 11 years ago

I think we're just looking at this with different lenses. I'd like the libraries I write to work on different platforms (memcached, App Engine) and with different client implementations (pure Python, pylibmc). The memcache-collections code I mentioned works with all of these and anything else that implements the current memcache interface. It would be a much taller order to have App Engine and pylibmc adopt your completely new API than a backwards compatible enhancement to make CAS more usable.

I would probably take a different approach than you are with python-memcached2 in order to leverage the userbase of the existing API (and not limited to python-memcached users). Rather than put it here, I'll assemble some feedback into a python-memcached2 issue.

linsomniac commented 11 years ago

Understood, I know that a migration to python-memcached2 will be a longer term path, but I decided to give it a shot starting from a completely different direction and making something more pythonic. As I've mentioned, before releasing it for final consumption, I'm planning a current API compatibility, so there will be both the options, just in much cleaner, better tested and documented code.

But I expect it'll also have to take advantage of the connection pooling that I'm currently working on, as that's one of the biggest reasons that the OpenStack Swift project implemented their own memcache library.

I'm definitely open to your proposal for the current memcache API, but to provide feedback on it I need to understand the uses cases. Thanks for working through this with me.

Sean

fakemumbler commented 10 years ago

"Furthermore, the Python memcache implementation makes bugs related to CAS ID lifetime exceedingly hard to detect since it silently succeeds cas() calls for which a CAS ID couldn't be found."

As a memcache newbie, I was bitten badly by this when using vpelletier/python-memcachelock (which relies entirely on the cas functionality). In the example, it doesn't mention cas_cache (backwards incompatible change in 2011?), the absence of which makes the .cas function equivalent to a set.

Why does cas fall back to a set rather than scream loudly?

Moreover, wouldn't it cover 99% of cases if memcache simply cached the (key, cas_id) from only the last 'gets' call (no cache blow-out, and basic cas operations work)?

sebhaase commented 3 years ago

Is this a "wontfix"? And also, please raise an exceptions rather than silently fallback to no-checking "set"!!