Implement a better abstraction layer for efficiently handling annotations

jviotti commented 2 months ago

In JSON Schema, annotations are JSON values associated to a given instance location (a JSON Pointer) and evaluation path (a JSON Pointer). A container to hold annotations, in theory could be something like this:

Map<Pair<Pointer, Pointer>, Set<JSON>> annotations;

Note that there can be multiple JSON values associated with a given instance location and evaluation path, but duplicates are not taken into account.

Write Operations

Insertion: You need to be able to insert a JSON document into a given instance location and evaluation path. The challenge here is to avoid copying the given pointers and the JSON value.

insert(const Pointer &instance_location, const Pointer &evaluate_path, const JSON &value) -> void;

Removal: You need to be able to remove all annotations produced at a given evaluation path or its children, independently of its instance location. For example, if you want to remove annotations at evaluate path /properties/not, you also remove any annotation produced at evaluate path /properties/not/foo and /properties/not/foo/bar/baz.

remove(const Pointer &evaluate_path) -> void;

Read Operations

Adjacent containment: You need to be able to check if a given set of keywords (OR) emitted the given annotation value for a given instance location, sibling to the given evaluation path.

defines_adjacent(const Pointer &instance_location, const Pointer &evaluate_path, const Set<String> &keywords, const JSON &value) -> bool;

For example, you might be looking for the annotation value "foo" at instance location /myObject, from keywords bar or baz, sibling to evaluation path /one/two/three. That means the following combinations are valid:

/myObject / /one/two/bar / "foo"
/myObject / /one/two/baz / "foo"

General containment: You need to be able to check if a given set of keywords (OR) emitted the given annotation value for a given instance location, sibling to the given evaluation path or on any of its children

defines(const Pointer &instance_location, const Pointer &evaluate_path, const Set<String> &keywords, const JSON &value) -> bool;

For example, you might be looking for the annotation value "foo" at instance location /myObject, from keywords bar or baz, sibling to evaluation path /one/two/three. That means all the following combinations are valid:

/myObject / /one/two/bar / "foo"
/myObject / /one/two/baz / "foo"
/myObject / /one/two/whatever/a/b/c/bar / "foo"

Real Example

Consider the following schema. If the instance is an object, if it defines the property foo, the latter must be a string. If it defines any other property that is not foo, that property must be an integer.

{
  "properties": {
    "foo": { "type": "string" }
  },
  "additionalProperties": {
    "type": "integer"
  }
}

Now consider the following instance:

{ "foo": "bar", "baz": 1 }

Relying on annotations, the evaluation goes as follows:

First, evaluate properties. The foo property is defined and it validates against the given subschema
Emit the annotation value "" (instance location, the root of the document) / "/properties" (evaluation path) / "foo" (value)
Then, evaluate additionalProperties
This keyword will check any property in the instance that is not defined as an annotation at "" (instance location) / "/properties" OR "" (instance location) / "/patternProperties"
For the ones that don't match any, validate

jviotti commented 2 months ago

@michaelmior I just added a real example based on properties and additionalProperties, but yes:

Instance Location: The Pointer to the part of the instance being evaluated
Evaluation Path: The Pointer to the schema that shows the "call stack" of keywords evaluated

On evaluation path, you might start evaluating /properties, which then takes you to /properties/foo/type, which then takes you to /properties/bar/type, etc. It is where you are in the evaluation process with regards to the schema.

jviotti commented 2 months ago

In the current evaluator implementation, we directly keep the instance location and evaluation paths as pointers that we push to and pop from when executing every individual instruction:

https://github.com/sourcemeta/jsontoolkit/blob/ddc6917ba4af30d18bf64aadb6c25eb6bca32190/src/jsonschema/compile_evaluate.cc#L251-L252

michaelmior commented 2 months ago

Perfect! That clarifies things. Is it true that the actual value of annotations never need to be returned in the current implementation?

jviotti commented 2 months ago

Correct. You never need to "retrieve" the annotation values, just check containment given the adjacent and non-adjacent algorithms. That said, we do pass references to the value at the time of emission to the evaluation callback to the user for them to hook on them if needed (for whatever they are doing or for reporting reasons). But internally, the evaluator doesn't need them.

tony-go commented 2 months ago

hey @jviotti :)

I gave a first look at this document.

Here are my questions mark:

I struggle to understand why do we have two pointers here. Not why we pass two pointers, because I guess that it is due consumer constraints. But rather, why should we store them for annotations. The way that I understand it is that evaluate_path is just an extension of instance_location. Then I wonder if we could not have a path splited as tree. Or maybe do you prefer to use a Map because its way more easier to implement?
Regarding the defines_adjacent and defines APIs. When you said "from keywords bar or baz": Is that mean that const Set<String> &keywords contains keyword that have a "OR" relationship? What if we want a "AND" relationship?

jviotti commented 2 months ago

@tony-go

I struggle to understand why do we have two pointers here. Not why we pass two pointers, because I guess that it is due consumer constraints. But rather, why should we store them for annotations. The way that I understand it is that evaluate_path is just an extension of instance_location. Then I wonder if we could not have a path splited as tree. Or maybe do you prefer to use a Map because its way more easier to implement?

They are not related to each other. Each represent an orthogonal path that might change on its own. For example, consider this schema and instance based on the one I showed above:

// Schema
{
  "properties": {
    "foo": { "type": "string" }
  },
  "additionalProperties": {
    "type": "integer"
  }
}

// Instance
{ "foo": "bar", "one": 1, "two": 2 }

The evaluation with regards to those pointers goes like this :

Evaluation path "" (the root of the schema) to instance location "" (the root of the instance)
Evaluation path "/properties" to instance location "", as we evaluate properties against the top level object
Evaluation path "/properties/foo/type" to instance location "/foo", as we evaluate the foo subschema against foo
Evaluation path "/additionalProperties" to instance location "/one", the first additional property
Evaluation path "/additionalProperties" to instance location "/bar", the second additional property

As you see, each moves pretty much independently and you cannot consider one as an extension of the other.

I suggested a map above as a theoretical representation, but if we can do a tree, that sounds good too. Whatever satisfies those requirements.

Regarding the defines_adjacent and defines APIs. When you said "from keywords bar or baz": Is that mean that const Set &keywords contains keyword that have a "OR" relationship? What if we want a "AND" relationship?

It is an OR. I'll clarify!

jviotti commented 2 months ago

@tony-go @michaelmior One thing that could help with allocations is the observation that for the evaluation path, its JSON Pointer tokens almost always correspond to JSON Schema keywords (i.e. /propreties/foo/items/type). In that way, the tokens are not completely arbitrary and repeat quite often. For example, most schemas use properties and type.

One interesting thing we could do is have a variant of the sourcemeta::jsontoolkit::Pointer class (or extend that one?) so that you can instantiate it based on a "pool" of potential property tokens, so that instead of allocating the strings every time, we reference them from the pool?

i.e. imagine you can construct the pointer with a second argument that is the pool or something like that

michaelmior commented 2 months ago

so that you can instantiate it based on a "pool" of potential property tokens

I think that makes sense. I suspect that you'll have a lot of repetition even among tokens that aren't keywords, so the pool approach makes sense to me even without pre-populating the pool.

tony-go commented 2 months ago

@jviotti

As you see, each moves pretty much independently and you cannot consider one as an extension of the other.

Yeah, way more clear now! Also I wonder if we could not rename them to:

shcema_location / instance_location
schema_pointer / instance_pointer

I think it would be less confusing if it was like that. WDYT?

It is an OR. I'll clarify!

❤️

jviotti commented 2 months ago

I think it would be less confusing if it was like that. WDYT?

We shouldn't be inventing our own names. instance_location and evaluate_path are the right terms defined by the JSON Schema specification itself.

jviotti commented 2 months ago

@michaelmior

Perfect! That clarifies things. Is it true that the actual value of annotations never need to be returned in the current implementation?

Actually, I'm realising there is one case where we do need this: prefixItems emits an index of the last array position covered by it, which items needs to get. So I guess you do need to get all of the annotations produced for a certain location after all

jviotti commented 4 weeks ago

Closing this, given the work we are doing to make most of the compiler annotation agnostic

sourcemeta / jsontoolkit