msgpack / msgpack-ruby

MessagePack implementation for Ruby / msgpack.org[Ruby]
http://msgpack.org/
Apache License 2.0
764 stars 117 forks source link

Idea: Specialized bufferless Unpacker for strings #255

Closed casperisfine closed 2 years ago

casperisfine commented 2 years ago

Context

As you may have seen in my recent performance PRs, I'm trying to improve msgpack performance for our use case.

We use it a lot for caches etc, so our main use case looks like:

factory.dump(object) # => String
factory.load(string) # => Object

And the payload are of very varying size, some are fairly small. Parsing wise msgpack is very fast, but there's a fairly high flat cost for instantiating a new Packer / Unpacker which I'm trying to reduce.

Idea

Based on profiling, a good part of the Unpacker instantiation cost is for the Buffer, and then we spend time copying the string inside the buffer, while in theory we could directly read from the Ruby String.

So I'm think we could have the existing Packer use for IO objects, and a new one specialized for strings that would skip the buffering.

However this may lead to quite a lot more code and probably some duplication, hence why I'm gauging interest first.

@tagomoris is this some contribution you'd be interested in?

Alternatively (or additionally) if this seems too much, I'd be interested in exploring how the Packer / Unpacker instances could be safely re-used, with maybe some kind of pooling.

tagomoris commented 2 years ago

@casperisfine I want to hear more details about your use case before checking your idea deeply. My (our) assumption of usage is to instantiate packer/unpacker via a factory, then to use those in following many serialization/deserialization operations. If you do so, instantiation costs should be ignorable.

What does prevents you from a such way? Inconvenient methods of Packer/Unpacker? In other words, can't we solve your problem by introducing a kind of proxy object to a packer and an unpacker like this?

packer_unpacker_proxy = factory.proxy # instantiate a packer and an unpacker here
packer_unpacker_proxy.dump(object) # => string
packer_unpacker_proxy.load(string) # => object
casperisfine commented 2 years ago

I want to hear more details about your use case before checking your idea deeply.

Sure. I recently open source a large part of our code around this: https://github.com/Shopify/paquito:

We have many use case, but I'll focus on one to make it simpler: we use it to serialize Rails.cache content

My (our) assumption of usage is to instantiate packer/unpacker via a factory, then to use those in following many serialization/deserialization operations.

That explains why packer/unpacker instantiation wasn't particularly optimized.

What does prevents you from a such way? Inconvenient methods of Packer/Unpacker?

We're in a multi-tenant environment, so we want to be 100% sure buffer won't leak data from one request cycle to another.

I did explore re-using packer/unpacker about a year ago, but ran into various cases were I wasn't confident enough data wouldn't leak. For instance if a type unpacker fails, it leaves data in the buffer

e.g.:

require 'msgpack'

$factory = MessagePack::Factory.new
$factory.register_type(0, Symbol, packer: ->(s) { s.name }, unpacker: ->(p) { raise TypeError })
$unpacker = $factory.unpacker

def roundtrip(object)
  $unpacker.feed($factory.dump(object))
  ret = $unpacker.full_unpack
  raise 'missmatch' unless ret == object
end

roundtrip([1, 2, 3])

begin
  roundtrip([1, :foo, 3])
rescue TypeError => e
  p [e.class, e.message]   # expected
end

roundtrip([1, 2, 3]) # Still raising TypeError

I remember looking into callingunpacker.reset before feed(), but somehow I still wasn't very confident (I don't quite remember why, I may need do dig into my old work).

So I'm sure it's possible, but maybe the API could be more convenient for it. Hence why I suggested to bake this in msgpack itself as an alternative.

I still think a bufferless codepath would be preferable both perf and safety wise for cases where you only deal with strings and not IO, but I also understand it would be much more code.

casperisfine commented 2 years ago

Ok, so I tried to implement pooling again: https://github.com/Shopify/paquito/pull/4/files#diff-4e8bf10e0385f27cde5a6ed34a1a9bad350bb00044fe1efadea6092d7986cc5c

It's indeed much faster (msgpack 1.4.4):

Calculating -------------------------------------
            baseline    976.168k (± 4.2%) i/s -      4.886M in   5.014233s
              pooled      2.304M (± 0.8%) i/s -     11.700M in   5.078133s

Comparison:
              pooled:  2304079.2 i/s
            baseline:   976168.5 i/s - 2.36x  (± 0.00) slower

A couple notes:

There is Packer.clear and Unpacker.reset, could be nice for both to have either clear or reset.

Also I think it would make sense to have this baked in MessagePack, but there is a few challenges, because things like symbolize_keys are passed to Factory#packer() making it more complicated to pool properly.

tagomoris commented 2 years ago

I prefer pooling because of its maintenance cost. There are not so many maintainers (as you may know 😛) so I want to keep the implementation size as small as possible. Completely different Unpacker implementation seems too big to me.

I don't have clear idea about handling Unpacker options for pooled objects, though.

casperisfine commented 2 years ago

I don't have clear idea about handling Unpacker options for pooled objects, though.

What if we had Unpacker#symbolize_keys= ? Or something like that?

(untested pseudo code)

class Factory
  def load(io_or_string, **opts)
    unpacker = checkout_unpacker
    begin
      unpacker.set_options(opts)
      unpacker.unpack(io_or_string)
    ensure
      checking_unpacker(unpacker)
    end
  end
end
casperisfine commented 2 years ago

I prefer pooling because of its maintenance cost. There are not so many maintainers (as you may know 😛) so I want to keep the implementation size as small as possible.

On a side note, not that I'm advocating for making the code more complicated than necessary, but know that we use msgpack a lot, so we're committed to help maintaining it.

tagomoris commented 2 years ago

What if we had Unpacker#symbolize_keys= ? Or something like that? (untested pseudo code)

That seems to not work well because opts of Unpacker are also passed to its internal Buffer when initialized. So, how about having multiple pools per opts.hash?

tagomoris commented 2 years ago

On a side note, not that I'm advocating for making the code more complicated than necessary, but know that we use msgpack a lot, so we're committed to help maintaining it.

That's great to hear! Currently, I'm ok, but I'll keep it in my mind ❤️

casperisfine commented 2 years ago

So, how about having multiple pools per opts.hash?

My worry about this is is that Unpacker has 3 boolean options right now, so that's a lot of pools.

I think instead we could accept the options on the factory initializer, and raise if you pass options to a factory that has pooling enabled. e.g.:

f = Factory.new(pool: 2, symbolize_keys: true)
f.load(payload) # => Object
f.load(payload, symbolize_keys: false) # => raise SomeError
tagomoris commented 2 years ago

My worry about this is is that Unpacker has 3 boolean options right now, so that's a lot of pools.

My guess is, many users don't use so different combinations of options (many people just use 1 or 2 combinations of those options in a Ruby process). It may wrong.

I think instead we could accept the options on the factory initializer, and raise if you pass options to a factory that has pooling enabled. e.g.:

That looks good to me. Does f.packer(symbolize_keys: false) also raise error?

casperisfine commented 2 years ago

Does f.packer(symbolize_keys: false) also raise error?

Yes.

casperisfine commented 2 years ago

Actually, once the pool is created, we don't really want to allow register_type etc anymore. So maybe a simpler interface would be something like:

f = MessagePack::Factory.new
f.register_type # and other init ..
pool = f.pool(size, options = {})
pool.load(str)
pool.dump(obj)

This can be done purely in Ruby I think.

tagomoris commented 2 years ago

Having a dedicated object pool sounds nice. It seems almost equal to the proxy object I suggested here. (Or inequal... I don't take care of it).

casperisfine commented 2 years ago

Implemented as https://github.com/msgpack/msgpack-ruby/pull/266