Feature for defining data sources/relationships

json-schema-org / json-schema-vocabularies

Experimental vocabularies under consideration for standardization

53 stars 9 forks source link

Feature for defining data sources/relationships #26

Open awwright opened 6 years ago

awwright commented 6 years ago

This issue proposes a set of keywords that would let schemas reference external assertions about the permitted range of instances/values.

It is a different way of solving many of the same problems that $data and similar keywords hope to address, but in a more flexible fashion.

The basis of the proposal is a keyword that lets authors define the relationship of the value to other data:

{
   type: "object",
   title: "Purchase Order"
   properties: {
      "order_id": {
         type: "number"
      },
      "customer_uid": {
         type: "integer",
         valueRange: "http://example.com/type/user/uid"
      },
}

This specifies that the "customer_id" property must be an integer, and must be within some externally defined range of legal values (identified by a URI).

How this is implemented would vary between validators. Validators might allow coders to implement custom handlers, and do something along the lines of:

validator.register("http://example.com/type/user/uid", function(uid){
    return db.select("SELECT * FROM user WHERE uid=@0", uid).count() > 0;
});

Other keywords would automatically keep track of which values become legal to use:

{
   type: "object",
   title: "Purchase Order",
   valueDefine: "http://example.com/type/user",
   valueUniqueKeys: "http://example.com/type/user/uid",
   properties: {
      "uid": {
         type: "number",
         valueDefine: "http://example.com/type/user/uid",
      },
      "name": {
         type: "string"
      },
}

Here, the valueDefine keyword registers the instance value as legal, when other instances assert a value must be within the given range.

Similarly, the valueUniqueKeys keyword specifies that any values defined under the range name can be used to uniquely identify the whole, larger object they're found within. With these two keywords, I can ask the validator "Find me the document where uid = 3" and it returns an entire instance which has the property {uid: 3, name:"Alice"}.

I'm going to continue refining this, and maybe take some inspiration from JSON-LD; however I think this is in a good state to ask for feedback.

Related issues: json-schema-org/json-schema-spec#340

handrews commented 6 years ago

Both this and $data require adding a step to fetch something external to the schema at runtime before assertions are evaluated, so to me that is the key issue that we need to decide on. Where $data restricts that notion of "external" to the instance (which is already involved in processing), this proposal extends the concept even further to allow any external data source.

To me this feels like an even larger change than $data. Is that the intention or am I missing something here? It would take us from schema evaluation as a function of (schema, instance) and make it a function of (schema, instance, [arbitrary data sources])

How does the principle of least power apply to these use cases?

awwright commented 6 years ago

Good points.

This does allow any data source to be referenced, I think that flexibility is desirable; I don't want to encourage people to shoehorn in more JSON data into their document than would otherwise be wise, just so they can use this feature.

I don't think it's too much more or less powerful than $data except for a few points:

this creates new keywords that encourage authors to be more transparent about what they intend to do. Because the statements are just declaring things, like "index this value", applications can do useful things with those statements, like offer to search the indexed values.
this doesn't add behavior to existing keywords
- aside: For $data, it would probably be preferable to suffix keywords with something that indicates the property value isn't literal and needs additional parsing, like { "minimum*": "0/field" }
$data is somewhat opaque in what the author intends. If you write a statement like (as above) { properties: { "top": { "minimum*": {"$data": "0/bottom" } } } how would the validator know to say "Expected top to be greater than bottom"? The $data syntax is too powerful to discern this, without implementing some sort of pattern matching AST.

gregsdennis commented 6 years ago

It should be noted that separating data from logic (even if that logic is data, e.g. a schema document) is a common practice. This strategy would allow the data to be updated without having to update the schema and risk changing the logic.

Most notably in the .Net world is Jon Skeet's NodaTime library. About a year ago, Jon moved to publishing NodaTime in multiple Nuget packages, one for logic and one for calendar data. The one for logic would be updated only for bug fixes and such, while the calendar data one would be updated for data accuracy.

handrews commented 6 years ago

@awwright Thanks for the responses, they've been very helpful and thought-provoking, and I'm really glad that you're tackling htis.

@gregsdennis that's a good point about data vs logic separation, also very helpful for setting a wider context. Sometimes I forget to look past JSON Schema and particular aspects of hypermedia. Your comment made me re-think some things. I'm generally sympathetic to arguments based in widespread best practices.

I think there are two orthogonal concerns here:

What data sources are allowed?
- $data allowed just the instance
- This proposal allows literally anything
What information should be conveyed by keywords?
- $data conveys no information, it is simply a data pointer, most analogous to $ref
- This proposal requires keywords to convey usage (as most keywords do)

For the first point, I'm still concerned over allowing arbitrary external data sources. Currently, processing a schema and an instance is a function of:

The schema document, and any referred schema documents
The instance document
The schema's base URI (for $ref resolution)
Possibly the instance's base URI (for base and href resolution in Hyper-Schema)

An implementation can be supplied the schema(s) and instance pre-parsed into the data model, so technically an implementation need not handle any sort of parsing.

I don't think it makes sense to require validators to handle arbitrary connections that return arbitrary output. However, I think we can look to $ref for a possible solution. An implementation can automatically dereference $ref (whether over an network or by examining its local schemas), and can expect an application/schema+json document as a result.

I could support saying that keywords can rely on URI-identified external data sources as long as those data sources supply data in the JSON Schema data model. This could either be in the form of an application/json or application/*+json document, or pre-parsed data. Again, this is similar to $ref, for which most implementations cache the parsed schema.

This puts the burden of translation onto the data source, and not on the JSON Schema implementation. As with $ref, the URI is an identifier but not necessarily a locator, so implementations will need to provide some extensibility around resolving such URIs. I think that is fine, as many implementations offer various options and extensibility points for resolving $ref.

I'll need to think a bit more on the second point, about purposeful vs generic keywords. In particular, I don't follow your valueDefines example. What does it mean that it "registers the instance value as legal", and why would I do that? valueRange and valueUniqueKeys` seem clear enough to me.

handrews commented 6 years ago

@awwright any thoughts on this? Or on how this might address json-schema-org/json-schema-vocabularies#20 or json-schema-org/json-schema-spec#541?

In particular, how would these keywords use instance data, or is the intention specifically that they cannot do so? In which case this proposal actually would not have any overlap with $data.

My first impulse would be to say that the URI of the instance is the base URI for any URI-reference values (similar to how many hyper-schema keywords work).

handrews commented 6 years ago

Bringing over some commentary from json-schema-org/json-schema-spec#541:

I'm proposing that all of the possible $data-tagged features (including but not limited to loading instance data into the schema, loading external data into the schema, and asserting relationships among instance locations) be worked on as a new vocabulary. All of the proposals in this area add substantial complexity, and also do not need to change the existing core and validation specification concepts. With vocabulary support being added in draft-08, having one or more vocabularies for this area would allow it to develop independent from core and validation, which we hope are approaching a final draft. I expect this area would be pretty active with new ideas and feedback, which would delay finalization significantly if added to core or validation (and it's unrelated to hyper-schema, which has its own mechanisms for working with instance data in URI Templates).

@awwright I'm assuming from your thumbs-up on that comment in json-schema-org/json-schema-spec#541 that you're OK with this approach. I'm working on the vocabulary support PRs now, so we can make sure that the vocabulary concepts will support doing this (I can't see any problems with that right now).

pdl commented 6 years ago

It seems to me that there is a general use case of 'As a schema author, I want to make an assertion that this data conforms to something I can refer to but do not necessarily want to write out in full in the JSON Schema I share with everyone', for instance because one or more of the following apply:

the set of allowed values is a large list that is impractical to serve with the schema (but could be retrieved using e.g. a separate HTTP GET request).
the set of allowed values is a list that is not static.
the set of allowed values is a very large list that is impractical to retrieve in full, but can be queried.
the set of allowed values is a list which only some validators have access to.
the set of allowed values is infinite (e.g. 'is a prime number') but membership in the set can be determined by applying some algorithm which is more complex than JSON Schema can express.
the set of allowed values relies on comparing multiple fields in the value (e.g. for this to be valid, this.width must be greater than or equal to this.height / 2).
querying the set of allowed values is expensive or has side-effects.

You can cut this use case various ways - where does the data come from? does it require computation? But I am not sure the distinction is actually helpful, as at most it allows you to squeeze some of the use cases into some HTTP + JSON Schema or something, whether or not those are the right implementation solutions.

I reckon getting data is a red herring. Most of the time what you actually want to do is apply a piece of application code to the value being validated (and maybe some other arguments), and get a boolean value as @awwright demonstrated:

validator.register("http://example.com/type/user/uid", function(uid){
    return db.select("SELECT * FROM user WHERE uid=@0", uid).count() > 0;
});

I suspect that furthermore, there is a need to partially validate schemas, wherein one validator can make qualified affirmations of validity where other validators can confirm total validity, e.g. 'This document is valid for the things I know how to test' vs 'This document is valid and I have tested everything in the schema'. Here I am thinking of a client/server use case. The client only troubles itself to check that the customer_uid is an integer; however, the server also checks that the cusotmer_uid is that of a valid customer (and maybe that the order belongs to the customer as well).

It strikes me that the first step should be to create a syntax which will allow schema authors to refer to 'custom' assertions as extensions in a way which will allow vocabulary and conventions to develop outside the spec, with a view to formalising groups of related vocabularies into standards which validators could then support natively and advertise support. Incidentally, I would like to see the uri in this case to indicate not so much the extension as implemented (which might be language specific), but the feature spec, e.g. a test suite which validates that extension.

This means that new features can be developed, tested, and demonstrated outside of the core. Users can provide their own implementations without having to get their code into the validation engine for their language, which can be kept minimal. If the features are are niche, those who want them get to keep them. If they are useful to lots of people, the tests and implementations can be shared as part of an ecosystem. Really useful features might eventually make it into the core.

Is this the problem space which vocabularies would solve @handrews?

awwright commented 5 years ago

I was working on an application and wondering if there's a few standard ways for looking up values, that we could describe; and then allow implementations to offer optimizations/alternative versions.

For example, derive a URL from a URI Template like http://example.com/user{?uid}, and an implementation could use a 2xx or 404 response as an indication that the provided value for uid is either valid or invalid (respectively).

Then, implementations could allow optimizations to the HTTP request process. So now, our hypothetical API might go like:

validator.register("http://example.com/user{?uid}", function(uid){
    return db.select("SELECT * FROM user WHERE uid=@0", uid).count() > 0;
});

kenisteward commented 5 years ago

@awwright I was literally just thinking that for an implementation I'm looking for here!! https://github.com/formly-js/ngx-formly/issues/1056

In this library, you can generate UI forms using a jsonschema as of the newest beta. I would like to also add to that being able to generate validations in a standard way. Right now, I'd have to rely on an extra "validation.json" with my predefined rules to generate that and also share that rule set and parser with the server.

In my case, I have a server with arbitrary validations that we want to apply cross field or cross internal systems. For that, I want the call to simply return 2XX for valid and 400 for this is not a valid value. On the client, I want to be able to define things like lastName requires firstname. the dependencies keyword is very good for "directly requiring a value exists" but as you know it is not good for defining a property with a value exists.

Is there something I can do here to help with this? Do we just need ideas at this point? I am very interested in this feature set and can most likely convince my team to allow me to offer some regular sprint time towards helping.

gregsdennis commented 5 years ago

... generate validations in a standard way.

@kenisteward you may be interested in json-schema-org/json-schema-spec#643 as well.

kenisteward commented 5 years ago

@gregsdennis Thanks for the insight! json-schema-org/json-schema-spec#643 is not actually in the standard yet right? will it be soon? this will be very helpful for error reporting on our API's. I wasn't able to find the corresponding PR to hopefully add this as an extension for a validator for our usage.

gregsdennis commented 5 years ago

It's planned for draft-08, which is slated for this month/year/🤞. There's not a PR on it yet. I typed up the issue based on conversations in another issue, but I'm not the spec author type.

handrews commented 4 years ago

@awwright Since this is proposing a set of new keywords, I'm going to move it to the vocabularies repo. If you think that is in error, please feel free to move it back. From reading back through this, I think that json-schema-org/json-schema-spec#855 covers the general underlying issues, such as whether/how a keyword can access the instance or external data sources.

There's a lot of great discussion here and I don't want to close it. But I think continuing the effort in the vocabulary repo is the right move. Again, please move back if you disagree.

awwright commented 4 years ago

That makes sense. We should get into the habit of developing proposals like this as vocabularies.