memcachier / memjs

A memcache client for node using the binary protocol and SASL authentication
MIT License
197 stars 52 forks source link

Support SpreadSASLMemcachedCache: sharding values > 1MB #32

Open reubano opened 11 years ago

reubano commented 11 years ago

Does this library support spreadsaslmemcachedcache? See http://pythonhosted.org/Flask-Cache/.

SpreadSASLMemcachedCache – spreadsaslmemcachedcache

Same as SASLMemcachedCache however, it has the ablity to spread value across multiple keys if it is bigger than the memcached treshold which by default is 1M. Uses pickle.

alevy commented 11 years ago

I mean... no. this would be very difficult (and probably not desirable) to do exactly since pickle is very Python specific. However, doing something similar, where values are broken up would be relatively straight forward (i've done it for Dalli). Not sure if this makes sense as a separate library/code snippet. If you get around to this I'll definitely appreciate a pull request. Otherwise, I might get around to it.

In general, there are some tricky tradeoffs with the kinds of strategies you might use to spread values across multiple keys, that I'm not sure how best to generalize (as opposed to giving code snippets for how to write a custom thing for each app, or a higher level library with multiple options or something). But very interested in seeing what can be done.

reubano commented 11 years ago

Can you post the code you used? The implementation doesnt matter (pickle vs something else) as long i can replicate the result.

Sent from my iPhone

On Sep 21, 2013, at 9:29 PM, Amit Levy notifications@github.com wrote:

I mean... no. this would be very difficult (and probably not desirable) to do exactly since pickle is very Python specific. However, doing something similar, where values are broken up would be relatively straight forward (i've done it for Dalli). Not sure if this makes sense as a separate library/code snippet. If you get around to this I'll definitely appreciate a pull request. Otherwise, I might get around to it.

In general, there are some tricky tradeoffs with the kinds of strategies you might use to spread values across multiple keys, that I'm not sure how best to generalize (as opposed to giving code snippets for how to write a custom thing for each app, or a higher level library with multiple options or something). But very interested in seeing what can be done.

— Reply to this email directly or view it on GitHub.

reubano commented 11 years ago

Bump

reubano commented 10 years ago

Just curious why you closed this since the issue isn't solved.

alevy commented 10 years ago

Hrm... mostly I was just doing cleaning, and hadn't touched this issue in almost a year... but..

I wasn't convinced this is a good feature to have in this level of library since it would need to provide a specific choice amongst a set of imperfect semantics: what happens when one of the keys is evicted? do you store keys sequntially (i.e. mykey_1, mykey_2, etc) and read until you get a miss or have a base key that contains a list of subkeys? do you split up values automatically if it's too large to fit or do you have a separate command?

My hunch is that all of the answers are correct for some people and I'm not sure there is a "good" one for most.

However, I can mention what I've done in a certain app algorithmically:

set:

  1. split data into chunks < 1MB
  2. set base key to the number of chunks, retrieving the version of the base key (this is not supported in memjs, but could and should be)
  3. Do a compare-and-swap with the version number from the base key on each of the subsequent keys. If it fails with an earlier version number, redo using the new version number. If it fails with a later version number, abort -- another thread has override your previous work already.

get:

  1. Get the base key with version number
  2. Get all referenced subkeys 0-n ensuring until the version doesn't match. If the non-matched version is higher, start over as someone has overriden the value.
reubano commented 10 years ago

Well, I'm far from an expert on this topic, but why not just do it how the python version does it? There has to be a nodejs equivalent of pickle, right? At the least, can you post code showing how you worked around this issue before?

reubano commented 10 years ago

And to answer one of the questions, you would select a spread mode upon instantiating memjs. In that mode, any values > max size would be split up automatically. Also, what's wrong with reading until you get a miss? Couldn't you just put it in a loop and place that in a try/except block?

alevy commented 10 years ago

pickle doesn't help much here, it's just a serialization format, and in any case, memjs only deals with values of type String or Buffer which are already serialized. So as far as splitting things up, that's just:

// largeValue is a `Buffer` to split, maxSize is the maximum size of any part 
function split(largeValue, maxSize) {
  var parts = []
  while(largeValue.length > 0) {
    parts.push(largeValue.slice(0, maxSize);
    largeValue = largeValue.slice(maxSize);
  }
  return parts;
}

Here is a gist of some ruby skeleton code for spreading this across the cache like this with Dalli: https://gist.github.com/alevy/5512907

alevy commented 10 years ago

The problem with just iterating until you get a miss is two-fold:

  1. When you actually get a miss, how do you know if the miss is because you finished reading all the parts? Or because that sub-key was evicted in the cache? So you need some sort of terminator sequence, but then what if that sequence happens to exist in the value itself, so you need to escape it in the value, etc...
  2. It's very slow to retrieve a long value -- you have to wait for a response from each subkey in order to proceed to the next one (you could mitigate this by batching, say, 5 at a time, but that sucks for values smaller than 5 sub-keys). Ideally, you'd like to be able to run most of the requests "quietly" such that your not waiting a round trip on each request until sending the next one. This is the difference between retrieving a large value in <10ms vs. many 100's of ms.

I think spreading values automatically is problematic because it breaks performance assumptions. I expect that when I call a memcache function on a low-level library, I get the performance of a single request (should be on the order 1ms). If all of a sudden I get a 2x or more performance hit because my value is slightly larger, that's less intuitive behavior (for a performance sensitive library) IMO than an error.

alevy commented 10 years ago

Having said all that, I think this discussion you've helped highlight some missing lower level features that would make doing this easier and would also be useful on their own (specifically having a way to get at the version of get and set response, as well as supporting compare-and-swap). I'll open new issues for those (and /cc you). I don't mind at all including a utility function (although probably in a different namespace) to split up values in a particular way -- we've basically already pseudo-coded the algorithm in this thread :).

reubano commented 10 years ago

I think you've misunderstood me. Spreading values wouldn't happen automatically. It would be an option in the setup, e.g., client = memjs.Client.create({spread: True}). By default, spread would be False.

alevy commented 10 years ago

Oh interesting. OK, i don't mind that as much... reopening