Closed casperisfine closed 2 years ago
@casperisfine I want to hear more details about your use case before checking your idea deeply. My (our) assumption of usage is to instantiate packer/unpacker via a factory, then to use those in following many serialization/deserialization operations. If you do so, instantiation costs should be ignorable.
What does prevents you from a such way? Inconvenient methods of Packer/Unpacker? In other words, can't we solve your problem by introducing a kind of proxy object to a packer and an unpacker like this?
packer_unpacker_proxy = factory.proxy # instantiate a packer and an unpacker here
packer_unpacker_proxy.dump(object) # => string
packer_unpacker_proxy.load(string) # => object
I want to hear more details about your use case before checking your idea deeply.
Sure. I recently open source a large part of our code around this: https://github.com/Shopify/paquito:
We have many use case, but I'll focus on one to make it simpler: we use it to serialize Rails.cache
content
My (our) assumption of usage is to instantiate packer/unpacker via a factory, then to use those in following many serialization/deserialization operations.
That explains why packer/unpacker instantiation wasn't particularly optimized.
What does prevents you from a such way? Inconvenient methods of Packer/Unpacker?
We're in a multi-tenant environment, so we want to be 100% sure buffer won't leak data from one request cycle to another.
I did explore re-using packer/unpacker about a year ago, but ran into various cases were I wasn't confident enough data wouldn't leak. For instance if a type unpacker fails, it leaves data in the buffer
e.g.:
require 'msgpack'
$factory = MessagePack::Factory.new
$factory.register_type(0, Symbol, packer: ->(s) { s.name }, unpacker: ->(p) { raise TypeError })
$unpacker = $factory.unpacker
def roundtrip(object)
$unpacker.feed($factory.dump(object))
ret = $unpacker.full_unpack
raise 'missmatch' unless ret == object
end
roundtrip([1, 2, 3])
begin
roundtrip([1, :foo, 3])
rescue TypeError => e
p [e.class, e.message] # expected
end
roundtrip([1, 2, 3]) # Still raising TypeError
I remember looking into callingunpacker.reset
before feed()
, but somehow I still wasn't very confident (I don't quite remember why, I may need do dig into my old work).
So I'm sure it's possible, but maybe the API could be more convenient for it. Hence why I suggested to bake this in msgpack itself as an alternative.
I still think a bufferless codepath would be preferable both perf and safety wise for cases where you only deal with strings and not IO, but I also understand it would be much more code.
Ok, so I tried to implement pooling again: https://github.com/Shopify/paquito/pull/4/files#diff-4e8bf10e0385f27cde5a6ed34a1a9bad350bb00044fe1efadea6092d7986cc5c
It's indeed much faster (msgpack 1.4.4
):
Calculating -------------------------------------
baseline 976.168k (± 4.2%) i/s - 4.886M in 5.014233s
pooled 2.304M (± 0.8%) i/s - 11.700M in 5.078133s
Comparison:
pooled: 2304079.2 i/s
baseline: 976168.5 i/s - 2.36x (± 0.00) slower
A couple notes:
There is Packer.clear
and Unpacker.reset
, could be nice for both to have either clear
or reset
.
Also I think it would make sense to have this baked in MessagePack
, but there is a few challenges, because things like symbolize_keys
are passed to Factory#packer()
making it more complicated to pool properly.
I prefer pooling because of its maintenance cost. There are not so many maintainers (as you may know 😛) so I want to keep the implementation size as small as possible. Completely different Unpacker implementation seems too big to me.
I don't have clear idea about handling Unpacker options for pooled objects, though.
I don't have clear idea about handling Unpacker options for pooled objects, though.
What if we had Unpacker#symbolize_keys=
? Or something like that?
(untested pseudo code)
class Factory
def load(io_or_string, **opts)
unpacker = checkout_unpacker
begin
unpacker.set_options(opts)
unpacker.unpack(io_or_string)
ensure
checking_unpacker(unpacker)
end
end
end
I prefer pooling because of its maintenance cost. There are not so many maintainers (as you may know 😛) so I want to keep the implementation size as small as possible.
On a side note, not that I'm advocating for making the code more complicated than necessary, but know that we use msgpack
a lot, so we're committed to help maintaining it.
What if we had Unpacker#symbolize_keys= ? Or something like that? (untested pseudo code)
That seems to not work well because opts
of Unpacker are also passed to its internal Buffer when initialized.
So, how about having multiple pools per opts.hash
?
On a side note, not that I'm advocating for making the code more complicated than necessary, but know that we use msgpack a lot, so we're committed to help maintaining it.
That's great to hear! Currently, I'm ok, but I'll keep it in my mind ❤️
So, how about having multiple pools per
opts.hash
?
My worry about this is is that Unpacker
has 3 boolean options right now, so that's a lot of pools.
I think instead we could accept the options on the factory initializer, and raise if you pass options to a factory that has pooling enabled. e.g.:
f = Factory.new(pool: 2, symbolize_keys: true)
f.load(payload) # => Object
f.load(payload, symbolize_keys: false) # => raise SomeError
My worry about this is is that Unpacker has 3 boolean options right now, so that's a lot of pools.
My guess is, many users don't use so different combinations of options (many people just use 1 or 2 combinations of those options in a Ruby process). It may wrong.
I think instead we could accept the options on the factory initializer, and raise if you pass options to a factory that has pooling enabled. e.g.:
That looks good to me. Does f.packer(symbolize_keys: false)
also raise error?
Does
f.packer(symbolize_keys: false)
also raise error?
Yes.
Actually, once the pool
is created, we don't really want to allow register_type
etc anymore. So maybe a simpler interface would be something like:
f = MessagePack::Factory.new
f.register_type # and other init ..
pool = f.pool(size, options = {})
pool.load(str)
pool.dump(obj)
This can be done purely in Ruby I think.
Having a dedicated object pool
sounds nice.
It seems almost equal to the proxy object I suggested here. (Or inequal... I don't take care of it).
Implemented as https://github.com/msgpack/msgpack-ruby/pull/266
Context
As you may have seen in my recent performance PRs, I'm trying to improve
msgpack
performance for our use case.We use it a lot for caches etc, so our main use case looks like:
And the payload are of very varying size, some are fairly small. Parsing wise
msgpack
is very fast, but there's a fairly high flat cost for instantiating a newPacker / Unpacker
which I'm trying to reduce.Idea
Based on profiling, a good part of the
Unpacker
instantiation cost is for theBuffer
, and then we spend time copying the string inside the buffer, while in theory we could directly read from the Ruby String.So I'm think we could have the existing
Packer
use forIO
objects, and a new one specialized for strings that would skip the buffering.However this may lead to quite a lot more code and probably some duplication, hence why I'm gauging interest first.
@tagomoris is this some contribution you'd be interested in?
Alternatively (or additionally) if this seems too much, I'd be interested in exploring how the
Packer / Unpacker
instances could be safely re-used, with maybe some kind of pooling.