snowplow / iglu

Iglu is a machine-readable, open-source schema repository for JSON Schema from the team at Snowplow
http://www.snowplow.io
Apache License 2.0
206 stars 44 forks source link

Placeholder for Maven repo inside Iglu Server #88

Open alexanderdean opened 8 years ago

alexanderdean commented 8 years ago

To host POJOs, Scala case classes, Clojure Schemas auto-generated from JSONs, Thrifts, Avros etc

chuwy commented 8 years ago

Initial draft

Not exact Maven repo, but rather "Creating Scala classes from JSON Schemas", but it was referenced from internal issues tracker.

Few projects to explore:

For me SBT Datatype looks like most promising (especially taking in account its origin). I think I will explore what can we do with it. Others libs listed here just for some technical details.

Definition generation

SBT Datatype uses format other than JSON Schema, but fairly straightforward. We can generate it the same way as we do with Redshift DDL. So, for example, having following JSON Schema (for iglu:com.acme/example/jsonschema/1-0-0):

{
  "type": "object",
  "properties": {
    "firstLevel": {
      "type": "integer"
    },
    "nested": {
      "type": "object",
      "properties": {
        "nestedInt": {
          "type": "integer"
        }
      },
      "additionalProperties": false
    }
  },
  "additionalProperties": false
}

We can generate following SBT Datatype definition:

{
  "types": [
    {
      "name": "Example$Nested",
      "type": "record",
      "target": "Scala",
      "fields": [
        {
          "name": "nestedInt",
          "type": "Int"
        }
      ]
    },
    {
      "name": "Example",
      "type": "record",
      "target": "Scala",
      "fields": [
        {
          "name": "firstLevel",
          "type": "String"
        },
        {
          "name": "nested",
          "type": "Example$Nested"
        }
      ]
    }
  ]
}

Adding $ to class name should help us avoid namespace collisions (nested can be defined on many objects). Following should in the end generate case class-like classes (with toString, hashCode, companion object etc, but without unapply):

final class Example$Nested(val nestedInt: Int) extends Serializable
final class Example(val someInt: Int, val nested: Example$Nested) extends Serializable

So we could write following in type-safe way:

example.nested.nestedInt 

Still need to make many decisions about dynamic-json-to-static-scala correspondence, but some simple cases should work.

Iglu integration

So, assuming above will work, we need to:

  1. Include registry into sbt project
  2. reference generated classes in code in some Iglu-compatible way (optional)
  3. parse plain JSON into generated classes

I'm trying to design it assuming as few non-existing features as I can in SBT Datatype. So I'm going to mark everything we cannot do with it (we can fork or PR of course, but not sure they're going to include anything Iglu-specific)

Including into project

Let's assume we want to create SqlQueryEnrichmentConfig class from Iglu Central in Scala Common Enrich.

a. Enable SBT Datatype plugin in SCE's plugins.sbt b. Run igluctl against Iglu Central JSON Schemas to generate SBT definitions in sbt-datatype directory (along with schemas, ddl etc) in Iglu registry c. Release Iglu registry to Maven repository d. Include it as a dependency: "com.snowplow" %% "iglu-central" % "58", so it can be embedded registry (assuming it is ok to publish projects only with resources) e. Set datatypeSource in generateDatatypes := file("resources/sbt-datatype"). Not sure if SBT Datatype can do it for third-party projects.

For now I'm really unsure only about last one, everything else should work on this step.

Reference and parse JSONs in Iglu-compatible way

It is a bit trickier. It's definitely possible to create some macro flavor to access it using Iglu URI as a string (but it looks like overcomplicated way):

import com.snowplowanalytics.iglu.registry // macro for access to class by string and containing serializers for parse

val json: JValue = ???

// otherwise it can be shapeless-like problem, when type name twice as long as its value
type SqlQueryEnrichmentConfig = 
  registry
  .schema("iglu:com.snowplowanalytics.snowplow.enrichments/sql_query_enrichment_config/jsonschema/1-0-0")
  .OutType

// it won't compile if corresponding URI hasn't been found
val enrichment: Either[String, SqlQueryEnrichmentConfig] = 
  registry
  .schema("iglu:com.snowplowanalytics.snowplow.enrichments/sql_query_enrichment_config/jsonschema/1-0-0") // jsonschema? Or sbt-datatype?
  .parse(json)

/cc @alexanderdean

chuwy commented 8 years ago

But actually, original idea with other way round (with Maven inside Iglu, not Iglu on Maven) has its own clear benefits.

alexanderdean commented 8 years ago

Very interesting approach @chuwy ! Looking forward to mulling it some more...

alexanderdean commented 8 years ago

Having thought about it some more: while the idea of making e.g. Iglu Central embeddable inside an app is interesting, one of the flaws is that it depends on a versioning scheme for a registry which doesn't really exist: "com.snowplow" %% "iglu-central" % "58". The R58 there has no semantic meaning - it's just an artifact of the fact that we are using git with formal "releases" to back Iglu Central. The versioning inside an Iglu registry is all at the schema level, so really a developer would want to pull in a dependency like:

"com.mandrill" %% "message_opened_1" % "0.0"

where this corresponds to com.mandrill/message_opened/jsonschema/1-0-0.

alexanderdean commented 8 years ago

This is the closest we have to this currently:

https://github.com/snowplow/snowplow/blob/master/2-collectors/scala-stream-collector/src/main/scala/com.snowplowanalytics.snowplow.collectors/scalastream/sinks/AbstractSink.scala#L34

chuwy commented 8 years ago

Releasable Registry with explicitly defined milestones is more or less proven against current patching approach.

We can use it without milestones only if we're going to abandon patching after Open Versioning is embraced. I cannot see if open versioning can really help us with patching.

chuwy commented 8 years ago

To elaborate:

case class Example(foo: Integer)

After patching can easily become:

case class Example(foo: Option[Integer])

Which is a huge problem for both releasable and unreleasable (as it is binary and source incompatible), but having explicit milestone we can at least see how exact Schema was look like in some milestone.

alexanderdean commented 8 years ago

The problem is that releasable registries is just a convention, it's not an intrinsic part of Iglu - a GitHub tag is not first class in any way in an Iglu registry. Even if it were, it's a very clunky level of indirection - "I want to reference Mandrill schema blah in my app, which GitHub tag do I need to cite to get that?"

Schema patching is easily handled like this:

"com.mandrill" %% "message_opened_1" % "0.0.4"

where this corresponds to com.mandrill/message_opened/jsonschema/1-0-0, 4th patch of the schema.

chuwy commented 8 years ago

Patch approach looks good for me. Not that I really like idea that minor version can introduce source/binary incompatibilities, but for now it is probably best we have.

chuwy commented 8 years ago

And where patch is defined? If we're going to do it manually - we'll need some sort of release as well?

alexanderdean commented 8 years ago

It feels like we are going to have to make patches first class inside an Iglu registry - i.e. for a given schema you can see which patch release it is currently...

chuwy commented 8 years ago

Yep, feels like that was going to happen anyway. These patches can be too important sometimes to just drop this information.