snowplow-archive / schema-guru

JSONs -> JSON Schema
http://snowplowanalytics.com
151 stars 20 forks source link

Ability to merge json models into final model #139

Open a1nayak opened 8 years ago

a1nayak commented 8 years ago

Currently the cli allows for merged model to be created on existing json files in a directory structure. For a use case, wherein, we want to have a final json model on a daily basis based on all json files received on that day, the volume of files we need to save on the directory structure to achieve the merged model is very large and isn't possible. We get millions per day.

If there is an ability to allow merge of models, then we can implement the use case for iterative model creation based on every message received - which becomes the final model at end of day. For ex - steps as below:

  1. For message1, create model1
  2. For message 2, create model2
  3. Merge model1+model2 as model1 Repeat the above for every new message coming - so at end of day the last model created by merge of above steps is the final model
alexanderdean commented 8 years ago

Hey @a1nayak - first off, if you are needing to derive schemas (I think you call these models) from millions of JSON instances, then you should check out the Spark-based runtime for Schema Guru rather than the CLI one.

On the wider point - allowing merge of models - this is an interesting idea. Is this because you want to evolve a fixed set of models on a daily basis, i.e. schema_d_tuesday should be the addition of schema_d_monday and all_instance_d_tuesday[]? If you could share your use case in a bit more detail, that would be helpful.

a1nayak commented 8 years ago

Thanks @alexanderdean for response. We have a kafka-storm-druid pipeline for analytics purposes wherein we capture data form multiple sources. Sometimes when the json data is sent, some attributes are missing or some may be added.

Since we are already capturing all messages in the pipeline, we wanted to extend to catch the schema drift - diff between a schema model for a particular source today vs the same source yesterday.

To capture schema drift - we would merge all the schema models generated for the source on that day and save it in filesystem. Similarly do it for next day - and then use jsondiff to highlight what got added or changed. This is a very important pointer for us as well as probably for your team too - when large data sets are involved and there is need to know how the drift is occurring

Addition of schema models for all days is achievable also with the same principles. But our use case presently was to capture the schema drift and eventually the data drift as it occurs per day basis.

alexanderdean commented 8 years ago

Hi @a1nayak - thanks for sharing that detail, it's super-interesting.

We don't have a direct problem with schema drift at Snowplow because we use self-describing JSONs which have to pass validation against their schema to progress down the pipeline - there's no concept of schemas changing without the business making the change. However, I agree that your approach would be nice for dynamic third-party sources (like some of our webhooks) where we can't control the end schemas and vendors can change them unilaterally (despite our efforts to record them faithfully in Iglu Central).

For your specific use case - can't you just use Secor or similar to get all JSONs from Kafka to S3, then run Schema Guru Spark on each day's data? In other words, I think the existing Schema Guru functionality should suffice?

a1nayak commented 8 years ago

@alexanderdean That's exactly I started doing - created a file directory structure per hour per each source and than ran schema guru on base. But the volume in very high in here as we are in large scale processing taking events from multiple sources (which we can't control). Each source generates close to 10 million files - which we wouldn't be able to keep in a file system to run at end of day

So the alternative was that I process them in memory - schema guru acts on message - but schema model generated is in memory as model1, next message comes in - becomes model2 and then model 1 gets replaced with model1+model2.

The above use case is for single thread. But it also works for multi-threading wherein multiple spouts get message and generate there own models and then it gets streamed to a single bolt which merges all the models into one and repeats for next set as explained in single threading.

The amount of data we run - we need 10-20 TB to store even for single day.

alexanderdean commented 8 years ago

Hey @a1nayak - that makes sense. I don't see why we can't expose a method in Schema Guru to merge one schema with another instance (or another schema) to product schema'. PR welcome!