redpanda-data / redpanda

Redpanda is a streaming data platform for developers. Kafka API compatible. 10x faster. No ZooKeeper. No JVM!
https://redpanda.com
9.65k stars 589 forks source link

[CORE-6836] schema registry json internal references #23461

Closed andijcr closed 1 month ago

andijcr commented 1 month ago

Support for internal references for json schema compat checks

Implementation strategy:

prepoc phase:

unbundle schemas and save them in a map <$id value, <jsonpointer to schema, dialect>>

note: it’s possible to have nested bundled schemas, the code eagerly collects anything that looks like a bundled schema.

then for each schema (main or bundled), resolve all relative $ref to absolute, using the correct base uri.

this is done by manually traversing the tree in parse_json, looking for "$id" or "id", and "$ref".

The object is then modified in memory.

No error is thrown in this phase, as it is relevant only if a reference is accessed during the compat check phase

Compat check phase:

Hook in get_schema to iteratively resolve references and use the previous map to handle references into bundled schemas.

Error out if:

To safeguard against recursion, a counter in schema_context is decremented every time a ref is resolved, and this counter is shared with all children of is_superset, the chosen value is 20

get_object_or_empty make use of get_schema but it does not modify the counter for its siblings, need to think about this.

reference merging

$ref with siblings is transformed, during evaluation, to allOf schemas. This captures the semantics defined by the json schema standard and the implementation used by validators like jsoncons.

an example:

this schema

{ 
"$ref": "#/a_fragment",
"type": "number", 
"minimum": 10
}

will be converted to

{
  "allOf": [
    { //schema at #/a_fragment
    },
    { "type": "number", "minimum": 10 }
  ]
}

not implemented yet

Fixes https://redpandadata.atlassian.net/browse/CORE-6836

Backports Required

Release Notes

andijcr commented 1 month ago

force push: addressed comments, added test cases

vbotbuildovich commented 1 month ago

new failures in https://buildkite.com/redpanda/redpanda/builds/55458#01924419-c191-4958-bfe6-605084731cb1:

"rptest.tests.schema_registry_test.SchemaRegistryAutoAuthTest.test_normalize.dataset_type=JSON"
"rptest.tests.schema_registry_test.SchemaRegistryTest.test_normalize.dataset_type=JSON"

new failures in https://buildkite.com/redpanda/redpanda/builds/55458#0192441a-3d8e-4787-b67a-3d90e154d27c:

"rptest.tests.schema_registry_test.SchemaRegistryAutoAuthTest.test_normalize.dataset_type=JSON"
"rptest.tests.schema_registry_test.SchemaRegistryTest.test_normalize.dataset_type=JSON"
andijcr commented 1 month ago

force push: rebase on dev, switched from jsoncons::json to jsoncons::ojson to preserve insertion order of the json keys.

    the switch to jsoncons::json in parse_json means that the function always performed
    key sorting, making the "normalize" flag redundant

    jsoncons can work in insertion-order mode, to do so we switch to the
    type alias jsoncons::ojson.

    this is done to preserve the original order of the input
    see
    tests/rptest/tests/schema_registry_test.py::SchemaRegistryAutoAuthTest.test_normalize
    for an example where this can be observed externally from the schema
    registry API
andijcr commented 1 month ago

force push: removed brittle heuristic, fixed a warning for an error message, tried to appease clang-format

andijcr commented 1 month ago

https://github.com/redpanda-data/redpanda/pull/23461#issuecomment-2388065178 edit: issue was between keyboard and chair

vbotbuildovich commented 1 month ago

ducktape was retried in https://buildkite.com/redpanda/redpanda/builds/55641#01924cdc-6216-4f22-bddd-a8d4fadad868

ducktape was retried in https://buildkite.com/redpanda/redpanda/builds/55641#01924cf7-13c1-441c-85f1-9474004c556e

ducktape was retried in https://buildkite.com/redpanda/redpanda/builds/55641#01924cf7-13bd-4fcc-b7ca-c41b5c03981e

ducktape was retried in https://buildkite.com/redpanda/redpanda/builds/55820#0192591e-d718-4be7-b33b-4e62a8ec48e3

andijcr commented 1 month ago

I have 2 commits here:

1. I didn't see tests for relative references so I added some

2. Recursion results in stack-overflow (this should be fixed)
  1. added them
  2. fix for this is to ensure that ref resolution is done only in is_superset. to do so i decoupled get_object_or_empty from get_schema. get_object_or_empty does not need the context because it's used for keyword like "properties" that can't be a reference (added a test for this), or for objects that gets passed directly to is_superset, so there is no need to resolve references