snowplow / iglu-scala-client

Scala client for Iglu schema registry
https://github.com/snowplow/iglu/wiki
Apache License 2.0
5 stars 13 forks source link

be able to resolve iglu schemas in validator #210

Open hamnis opened 2 years ago

hamnis commented 2 years ago

The underlying library support uri resolving for custom schemes. Attached is a possible patch that can enable this feature

circevalidator.patch

istreeter commented 1 year ago

This is a very interesting idea! If I understand it right, this means we can have schemas with properties like this:

"properties: {
  "xyz": { "$ref": "iglu:com.snowplowanalytics.iglu/anything-a/jsonschema/1-0-0" }
}

Until now, I have always discouraged people from using the $ref keyword in iglu schemas. And it was for three reasons:

  1. If the ref is a non-versioned url (e.g. http://example.com/path/to/schema) then it bypasses Iglu's strict versioning rules. It is a principle of Iglu that schemas are static once published, and that means we should not allow them to point to just any url.
  2. Snowplow's repo schema-ddl contains the code we use for converting from json schemas into warehouse table definitions. Currently that repo does not support the $ref keyword. It means if a schema uses $ref then for events in a Snowplow pipeline, the fields will not get loaded into the warehouse.
  3. When json-schema-validator looks up a $ref, it blocks the jvm thread. In your patch, this happens on the line where you call dispatcher.unsafeRunSync(...). In Snowplow apps, which are required to be high throughput, we have to be very careful about which threads get blocked and when.

Your patch addresses problem 1, and I think that is a huge improvement. It is a big step in the right direction.

But I think we need to properly address problems 2 and 3 before we can claim that Iglu fully supports this type of external lookup.


A couple of smaller implementation comments:

I don't think you need the ValidatorF trait. I think your new validator implementation could simply implement the old Validator trait.

If we add this feature, I would prefer to amend the old validators instead of adding a new validator. It would be much simpler if we can say that all iglu validators support external refs, instead of saying that some do and some don't depending on how you use it. This might need a breaking change to the code, so it would be a new major release of the library (version 3).

voropaevp commented 1 year ago

If iglu-client substitutes $ref for its contents when json body is returned, it would solve the point 2.

istreeter commented 1 year ago

@voropaevp that's a good idea. It would solve the point 3 too, because if we do it in the resolver then we have better control over how threads/fibers get used.

voropaevp commented 1 year ago

Implementation would need to be careful about resolution depth and possible loops.

hamnis commented 1 year ago

A way to fix the blocking issue is to change Sync[F].delay to Sync[F].blocking or Sync[F].interruptable if this is a cancelable operation.

stanch commented 1 year ago

Hi @hamnis, I like how this brings us closer to JSON Schema compatibility, although it would be interesting to understand the use cases behind this feature. That would help us think through if any changes are needed in our UI flows as well.

I assume you’d like to reuse the same “subschema” in a set of schemas? Would this be primarily for event schemas or for entity schemas? Could you give some examples? In particular, if the goal is to add the same “subschema” to multiple event schemas, it’s already possible just add that “subschema” as an entity/context to all such events. But I imagine there is something you prefer in the $ref approach?

hamnis commented 1 year ago

This would be reuse in other entity schemas.

A particular use case is that we have a concept called Plugs, each plug displays a Content item. There may be multiple plugs visible plugs on a page, which also May represent a Content item.

We would typically attach a set of plugs to an impression event.

For now we have embedded the Content definition within our Plug definition.

I can see other usecases, but this is what we currently have.

marcin-j commented 1 year ago

Hey!

Any updates on this?

Our use cases:

  1. Attach same set of reusable, required fields to events. I know we could use contexts, but we would really like to have those fields as non-optional and contexts are actually optional AFAIK.
  2. Be able to extend events with additional fields.