Closed eigenvariable closed 7 years ago
Tsar depends on an internal algebird thing which has the thrift structures your mentioning. Having an algebird-thrift sub project that generates scrooge structures might be good. Doesn't really belong in chill I don't think since thats the kryo orientated lib
Re: not in chill: that makes sense. Since I don't remember very well what the internal thrift (or the corresponding injections) look like, I can't remember: would it make sense to just throw that internal library over the fence, or is there application-specific hardcoding going on?
Hey,
so do we have any conclusion regarding this?
I'd be very interested in a simple rather fast semi-long-term serializer for Algebird Monoids
(we use just a few common data structures, count/size/sum
, moments
, RoaringBitmap
& HLL
, and we'd like to write/read them into HBASE; rebuilding from scratch every few months is not a problem). I'd assume:
Kryo
is fast, but doesn't fit - not intended for long term, as class registrations would be messed up with each code recompilation
Java Serializer
should work, but it should be horribly slow; so e.g. Apache Kylin
uses hand-crafted serializers of their own version of Monoids
simply writing to java.nio.ByteBuffer
so what are the other options? what serializer do you Twitter guys use together with storehaus
?
@avibryant @johnynek
HLL has a method that is supported for this use case:
At twitter, we made thrift versions of all these things and Bijections between the thrift and native versions. A bit ugly, but we knew we were in control of the serialization then and thrift was a good interchange format.
People have so many different serialization styles, I'm not sure there is an easy way to make everyone happy.
We could just make some typeclasses like: https://github.com/twitter/scalding/blob/develop/scalding-serialization/src/main/scala/com/twitter/scalding/serialization/Serialization.scala#L41
and include them here so as to keep dependencies light.
Not sure the right way to go. Thoughts?
/cc @ianoc
From an ease of use/development standpoint, I appreciate the value of the thrift we have for algebird. I don't think it would be too much trouble to push the definitions we have into algebird somewhere (algebird-thrift?).
From a performance standpoint, however, it leaves something to be desired. The reason relates to a conversation I had with Ian late last year, in that thrift is not a random-access storage format, and some of the objects (QTree and SketchMap in particular) have large object graphs (HLL being stored as one binary object already).
I quantified the impact of this recently, since we are considering applications where several of these objects get stored together. The methodology is something like:
binaryMode
indicates whether the thrift manifests as binary (i.e., pre-serialized) or as our internal thrift object. The lazy_binary
codec is an internal deserializer that some of you will be familiar with. Some results:
QTree:
binaryMode clusterCount codecName cutoff ms
binary 10 binary 0.99 3.70
binary 10 compact 0.99 3.60
binary 10 lazy_binary 0.99 3.57
non-binary 10 binary 0.99 10.09
non-binary 10 compact 0.99 12.51
non-binary 10 lazy_binary 0.99 5.16
SketchMap:
binaryMode clusterCount itemCount codecName us
binary 10 500 binary 3478.3
binary 10 500 compact 3601.7
binary 10 500 lazy_binary 2975.4
non-binary 10 500 binary 6859.0
non-binary 10 500 compact 14941.1
non-binary 10 500 lazy_binary 3004.3
Of course, these effects are more pronounced as the object groups get larger.
interesting data points. The compact slowing down is surprising. That lazy codec is pretty great. That @ianoc knows how to make things fast. :)
I'm sympathetic to having a good serialization story out of the box, but I think it makes a lot of sense to leave it up to the user as the general stance. An algebird-thrift with bijections/injections as a path sounds good.
Most of my reluctance comes from a few things where stuff has been troublesome in the past:
1) Well defined serialization formats/idl's like thrift can be easily consumed in a multitude of places 2) Backwards compatability story (look at our injections and similar when used for serialization) is pretty tough to reason about what it should be/enforce it. If it costs perf should you maintain indefinite backwards compatibility? Thats tough to say given it can have size/performance hits 3) Testing, regression testing all formats even backwards alone is an involved task, this is the biggest time sink of going this route i think, but would be critical if we sort of ship more pre-rolled toBytes stuff. Even maintaining the toBytes stuff in HLL i've found a pain at times. 4) Migrations between formats for users, this was a pretty big PITA at twitter around the heavy use in SB of injection based or code based serializers, Injection chains tend to often be woefully poor for materializing too many things in memory. Unwinding those chains though provided to just be untenable for a single job -- making cross cutting performance improvements for all scrooge users was much easier.
Not totally sold that algebird per say wants to be in the serialization business as annoying as that might be for users.
Closing in favor of #119 - I'm going to start on a plan for this. Will report back on that ticket.
It would be nice if there were a canonical, guaranteed-to-be-backwards-compatible way to serialize various algebird structures like hyperloglog, adaptive matrix, etc. Using thrift definitions--with maybe a little bit of scala wrapping--seems like a good way of doing things. Would it make sense to provide these as part of algebird and/or chill-algebird? My understanding is that the current implementation of the latter isn't meant for long-lived data.
Tweeps: I'm pretty sure that what I'm describing already exists in scalding-internal. I know that Tsar either depends on this or cargo cults it (unless a lot has changed in the last month, look in science/src/thrift/com/twitter/tsar/tsar.thrift).