julianpeeters / avrohugger

Generate Scala case class definitions from Avro schemas
Apache License 2.0
201 stars 120 forks source link

Support Enums #7

Closed mariussoutier closed 9 years ago

mariussoutier commented 9 years ago

Hi,

could you please review this first enum draft: https://github.com/mariussoutier/avrohugger/tree/enums

I'm not sure if it's the most elegant way to integrate parsing enumerations into your library. I'm also not sure how to integrate with SpecificRecord (but haven't really tried yet).

Cheers

julianpeeters commented 9 years ago

Looks like we're good. Mind giving version 0.2.3 a try when you have a moment?

mariussoutier commented 9 years ago

Enums working? Yay!

mariussoutier commented 9 years ago

Thanks for cleaning up my code :) I'll check it out tomorrow!

mariussoutier commented 9 years ago

Works perfectly, thanks!

julianpeeters commented 9 years ago

Awesome! Nice teamwork.

Pretty cool work you're doing with twitter bijection, too.

How does scalding.avro get away with inferring/omitting the schema? On Jul 20, 2015 1:24 AM, "Marius Soutier" notifications@github.com wrote:

Works perfectly, thanks!

— Reply to this email directly or view it on GitHub https://github.com/julianpeeters/avrohugger/issues/7#issuecomment-122810066 .

mariussoutier commented 9 years ago

High five! :hand:

I'd like to make this compatible with Bijection, but it doesn't seem as if they'll accept my PR. Problem is only the required public SCHEMA$ field. Avro feels quite hacky to me, a lot of stuff must be present at runtime.

Scalding has been a while, IIRC in the tuple API the main point was just to read the schema and access the fields via 'symbol notation. When I used the typed API, I used generated Java classes, and that worked ok.

What's your experience with all this?

julianpeeters commented 9 years ago

I love the Avro spec (particularly the evolvable schemas, allowing refinement of unstructured data), but yes, the Java Avro implementation seems hacky to me too. I'm usually in favor improving existing software, but IMO the project has gotten too large, and the problems too deep to simply patch it by adding more code. In my limited experience, many good issues and questions are ignored on the mailing list, so I've resigned myself to dealing with Avro as is (until I have the time/funding to write a small, limited-but-solid Scala implementation - wanna help?).

On one hand it's surprising that Twitter is resistant to adding such a simple method, especially given the existing toBinary, and the json method, below it, indicates that the method's signature is not outlandish.

On the other hand, I can imagine those devs being very fond of a concise code base. And not only are these Scala Avro projects new and not battle-tested, I'll be the first to admit that the public SCHEMA$ issue is evidence that using a Scala class with Avro is a hack.

But the wins are nice, aren't they? The only hickup in using Scala case classes as records has been the public SCHEMA$ field issue that you ran into, and so for me the pros outweigh the cons.

I mention Scalding.Avro because during my case class tests I noticed that it was the only framework that didn't exhibit that public SCHEMA$ field issue, so I'm a little surprised that Bijection doesn't use the same trick already. Looks like scalding.avro uses the same idea as your solution to the Bijection issue.

Bummer that it reflects at runtime (is there a type-class-based approach???), and just assumes it can get the schema from a reliable source at runtime, but that's no different from the current toBinary.

And the only thing that the class-based constructor does with the Class is reflect it to get a schema from it; it's (currently) only a shortcut to the schema-base constructor.

In conclusion, I agree with Oscar that the two methods are too similar/overlapping, and that it would be better to change the current toBinary (to your newest implementation).

Cheers

mariussoutier commented 9 years ago

It seems that Scalding relies on Cascading's Avro capabilities instead of using Bijection. In the binary representation, aren't the first n bytes reserved for the schema anyway, so you don't have to rely on a SCHEMA$ field? Or that was just a convention in Camus.

I agree that Bijection-Avro doesn't seem to be battle-tested, but works for the default (Java) path but can be brought down easily. I'm just not sure who should approve or improve the PR if nobody over there has a lot of experience with Avro.

In general I think Avro is a nice idea, and it has better features than Thrift or ProtoBuf, but I remember spending a lot of time just getting the schema to work in Kafka and Scalding. There were sooo many runtime errors, e.g. union {null, string} was not the same as union {string, null} and stuff like that. Maybe Pickling will become a better solution? (Haven't looked much into it yet.)

Being able to generate Scala case classes was an absolute requirement this time, so I'm happy to have found avro-hugger :) Just ping me when you want to start on Scala-Avro, I'd be happy to have something better than the current mess.

Cheers!