nathanmarz / cascalog

Data processing on Hadoop without the hassle.
Other
1.38k stars 178 forks source link

Schema Integration! #232

Open sritchie opened 10 years ago

sritchie commented 10 years ago

I'd love to see someone take on integration of Prismatic's Schema library with Cascalog. The ability to write schemafied operations, and have the Cascalog compiler validate schemas before submitting jobs, could avoid the runtime errors that are one of the only downsides to Cascalog :)

maxrzepka commented 10 years ago

I love your idea. Can you develop it or just share the use case you have in mind ?

I thought using thrift or any other serializer helps to enforce the schema.

Is it an alternative or they can play nicely together ?

In the same spirit as core.typed and schema "core.typed has accurate compile time checking, and Schema gives an expressive contracts interface for runtime checking." in https://news.ycombinator.com/item?id=6339607

Thanks,

sritchie commented 10 years ago

Yeah, for sure.

Thrift is nice for enforcing a schema when you write to disk - that safety kicks in when your job is running on the cluster. If you try to populate thrift objects with items of the wrong type, you'll get runtime exceptions after job submission. This is painful, and a big waste of time.

If Cascalog query definitions could use schema to check the input and output types of predicates, then Schema's "runtime" guarantee would prevent badly typed jobs from being submitted. So the runtime here is really a second compile time. This would play really nicely with Thrift.

maxrzepka commented 10 years ago

Thanks for the explanation. Sounds really exciting... Hope to get some time investigating it.