madeintandem / hstore_accessor

Adds typed hstore-backed field support to ActiveRecord models.
MIT License
242 stars 47 forks source link

question: YAML serialization performance #55

Closed gingerlime closed 9 years ago

gingerlime commented 9 years ago

This gem looks great. Thanks for creating it and making it open-source!

I'm curious about the choice of YAML for serialization. If I'm not mistaken, it's probably not as fast as JSON or perhaps other serialization formats?

crismali commented 9 years ago

We used to use JSON for hash serialization but switched to YAML recently. The main reason for the change was that if you serialized something like this as json:

{ foo: Time.current }

deserialization would return a hash like this:

{ "foo" => "2015-01-26 15:10:54 UTC" } 

YAML keeps Ruby in mind so when you serialize and then deserialize with it you get the same thing back.

gingerlime commented 9 years ago

As far as I see time/date doesn't use YAML though (EDIT: Ah, in case it's embedded in a hash... I see).

What about ruby Marshal as a potential serializer which could also be faster.

From some limited tests, YAML does seem considerably slower, so I'm just trying to figure out the trade-offs.

crismali commented 9 years ago

Using Marshal seems reasonable. Have you run any benchmarks that you could share here?

gingerlime commented 9 years ago

I found this gist (from this post), which is a bit old, but seems to produce similar results on my machine running ruby 2.2

array marshal  0.010000   0.000000   0.010000 (  0.003388)
array json  0.010000   0.000000   0.010000 (  0.011935)
array eval  0.020000   0.000000   0.020000 (  0.018352)
array yaml  0.420000   0.020000   0.440000 (  0.449878)
hash marshal  0.050000   0.000000   0.050000 (  0.052354)
hash json  0.040000   0.000000   0.040000 (  0.045483)
hash eval  0.090000   0.000000   0.090000 (  0.093732)
hash yaml  0.760000   0.000000   0.760000 (  0.779947)

I also ran some other tests on a larger nested hash, which also produced an order-of-magnitude difference between YAML and JSON / Marshal. This was based on a test harness I recently bumped into and seemed pretty solid. The original code tests something else, but it makes it easy to write your own tests. If you want I can try to create a fork with a few tests with hashes / arrays or one that you can tweak to run your own tests?

Perhaps this is premature optimization and in the grand scheme of things, with all the rails overhead etc - maybe this is just a drop in the ocean. It's just something that jumped out as a potential bottleneck.

I thought it's best to check first before jumping into any conclusions. For example, I completely missed the issue with JSON that you mentioned. So thanks again for clarifying.

jhirn commented 9 years ago

Hi Ginger.

Thanks for using the gem and helping to contribute. We've had a lot of internal conversation here about removing object type entirely form hstore accessor. As you've discovered, storing object types has more than its fair share of edge cases.

We could go to marshall instead of YAML, but I'd prefer not to. It ties the data to the specific implementation of the object. It also destroys readability of the data through standard SQL tools. Binary is just weird all around.

There are two alternatives to storing an object as the value of an Hstore column. One is to simply create a new active record object and store the id as an integer value in your hstore column. This plays pretty well with standard AR relationship stuff. Another would be to flatten the values of your object into your hstore column, using namespaces to categorize the attributes that belong to the object you're trying to store (i.e. my_obj_attr1, my_obj_attr2, my_obj_attr3,etc...).

If neither of these alternatives are suitable please provide more specifics of your use case to help us learn how to improve the gem.

-Joe

gingerlime commented 9 years ago

Hi Joe, Michael,

Thanks for taking the time to look into this and explain some of the challenges. I really appreciate it.

I can totally understand the rationale of picking YAML, but of course it has its price.

Having looked around a little, I'm wondering if you could adopt a similar approach to the one that ActiveRecord::Store uses? namely to allow to provide a serializer (they call it coder)? From what I see, rails uses IndifferentCoder, which would let you specify any object that implements load and dump.

ActiveRecord::Store seems to default to YAML too, but allows you to override the serializer with your own, if you want.

Not entirely sure how it changes your DSL/API and what it means in terms of implementing this, but it might be interesting to consider.

Cheers Yoav

crismali commented 9 years ago

While there isn't a pluggable serialization system in the gem now, you could serialize whatever you want however you want by overriding the getters and setters for a string type. Something like:

class Foo < ActiveRecord::Base
  hstore_accessor :data, bar: :string

  def bar
    Marshal.load(super)
  end

  def bar=(value)
    super(Marshal.dump(value))
  end
end
gingerlime commented 9 years ago

Thanks @crismali - that's an interesting workaround. This loses the explicit type declarations though.

Is there a real difference between this and doing something like this without your gem using rails built-in store_accessor?

Also, when I try to store more complex data, it works, but when I save I'm getting this error ERROR ActiveRecord::Base : ArgumentError: string contains null byte

maybe Marshal won't work even if it was set as a pluggable serializer then... as Joe said -- binary is blurgh (paraphrased)... didn't really dig too deep, just a very quick play-around in the console.

In any case, I'll let this thing rest. I'm not sure how much of a concern this really is in terms of real performance hit. I was mostly curious and appreciate your openness to look into this.

gingerlime commented 9 years ago

Another 2 cents - and I guess this is what you guys meant about discussing the whole type thing, but to store a hash inside an hstore is probably not the best idea... and same goes to any nested structure with complex data types. So I definitely see the trouble here. This kinda gives people a chance to shoot themselves in the foot.

Maybe same applies to any clever serialization and so on... I appreciate having the chance to 'discuss' this with you guys and think about this a bit further. Thanks again!