long-term serialization format for commonly used structures

eigenvariable commented 9 years ago

It would be nice if there were a canonical, guaranteed-to-be-backwards-compatible way to serialize various algebird structures like hyperloglog, adaptive matrix, etc. Using thrift definitions--with maybe a little bit of scala wrapping--seems like a good way of doing things. Would it make sense to provide these as part of algebird and/or chill-algebird? My understanding is that the current implementation of the latter isn't meant for long-lived data.

Tweeps: I'm pretty sure that what I'm describing already exists in scalding-internal. I know that Tsar either depends on this or cargo cults it (unless a lot has changed in the last month, look in science/src/thrift/com/twitter/tsar/tsar.thrift).

ianoc commented 9 years ago

Tsar depends on an internal algebird thing which has the thrift structures your mentioning. Having an algebird-thrift sub project that generates scrooge structures might be good. Doesn't really belong in chill I don't think since thats the kryo orientated lib

eigenvariable commented 9 years ago

Re: not in chill: that makes sense. Since I don't remember very well what the internal thrift (or the corresponding injections) look like, I can't remember: would it make sense to just throw that internal library over the fence, or is there application-specific hardcoding going on?

vidma commented 8 years ago

Hey,

so do we have any conclusion regarding this?

I'd be very interested in a simple rather fast semi-long-term serializer for Algebird Monoids (we use just a few common data structures, count/size/sum, moments, RoaringBitmap & HLL, and we'd like to write/read them into HBASE; rebuilding from scratch every few months is not a problem). I'd assume:

Kryo is fast, but doesn't fit - not intended for long term, as class registrations would be messed up with each code recompilation
- is there an easy way to make it work?
Java Serializer should work, but it should be horribly slow; so e.g. Apache Kylin uses hand-crafted serializers of their own version of Monoids simply writing to java.nio.ByteBuffer

so what are the other options? what serializer do you Twitter guys use together with storehaus?

@avibryant @johnynek

johnynek commented 8 years ago

HLL has a method that is supported for this use case:

https://github.com/twitter/algebird/blob/develop/algebird-core/src/main/scala/com/twitter/algebird/HyperLogLog.scala#L125

At twitter, we made thrift versions of all these things and Bijections between the thrift and native versions. A bit ugly, but we knew we were in control of the serialization then and thrift was a good interchange format.

People have so many different serialization styles, I'm not sure there is an easy way to make everyone happy.

We could just make some typeclasses like: https://github.com/twitter/scalding/blob/develop/scalding-serialization/src/main/scala/com/twitter/scalding/serialization/Serialization.scala#L41

and include them here so as to keep dependencies light.

Not sure the right way to go. Thoughts?

/cc @ianoc

jnievelt commented 8 years ago

From an ease of use/development standpoint, I appreciate the value of the thrift we have for algebird. I don't think it would be too much trouble to push the definitions we have into algebird somewhere (algebird-thrift?).

From a performance standpoint, however, it leaves something to be desired. The reason relates to a conversation I had with Ian late last year, in that thrift is not a random-access storage format, and some of the objects (QTree and SketchMap in particular) have large object graphs (HLL being stored as one binary object already).

I quantified the impact of this recently, since we are considering applications where several of these objects get stored together. The methodology is something like:

preprocessing: serialize a large number of objects together (50 in this case)
materialize a single object from the group
perform some combination (Semigroup#plus)
re-serialize everything

binaryMode indicates whether the thrift manifests as binary (i.e., pre-serialized) or as our internal thrift object. The lazy_binary codec is an internal deserializer that some of you will be familiar with. Some results:

QTree:

binaryMode clusterCount   codecName cutoff    ms
    binary           10      binary   0.99  3.70
    binary           10     compact   0.99  3.60
    binary           10 lazy_binary   0.99  3.57
non-binary           10      binary   0.99 10.09
non-binary           10     compact   0.99 12.51
non-binary           10 lazy_binary   0.99  5.16

SketchMap:

binaryMode clusterCount itemCount   codecName       us
    binary           10       500      binary   3478.3
    binary           10       500     compact   3601.7
    binary           10       500 lazy_binary   2975.4
non-binary           10       500      binary   6859.0
non-binary           10       500     compact  14941.1
non-binary           10       500 lazy_binary   3004.3

Of course, these effects are more pronounced as the object groups get larger.

johnynek commented 8 years ago

interesting data points. The compact slowing down is surprising. That lazy codec is pretty great. That @ianoc knows how to make things fast. :)

ianoc commented 8 years ago

I'm sympathetic to having a good serialization story out of the box, but I think it makes a lot of sense to leave it up to the user as the general stance. An algebird-thrift with bijections/injections as a path sounds good.

Most of my reluctance comes from a few things where stuff has been troublesome in the past:

1) Well defined serialization formats/idl's like thrift can be easily consumed in a multitude of places 2) Backwards compatability story (look at our injections and similar when used for serialization) is pretty tough to reason about what it should be/enforce it. If it costs perf should you maintain indefinite backwards compatibility? Thats tough to say given it can have size/performance hits 3) Testing, regression testing all formats even backwards alone is an involved task, this is the biggest time sink of going this route i think, but would be critical if we sort of ship more pre-rolled toBytes stuff. Even maintaining the toBytes stuff in HLL i've found a pain at times. 4) Migrations between formats for users, this was a pretty big PITA at twitter around the heavy use in SB of injection based or code based serializers, Injection chains tend to often be woefully poor for materializing too many things in memory. Unwinding those chains though provided to just be untenable for a single job -- making cross cutting performance improvements for all scrooge users was much easier.

Not totally sold that algebird per say wants to be in the serialization business as annoying as that might be for users.

sritchie commented 7 years ago

Closing in favor of #119 - I'm going to start on a plan for this. Will report back on that ticket.

twitter / algebird

long-term serialization format for commonly used structures #463