json-schema-org / json-schema-vocabularies

Experimental vocabularies under consideration for standardization
53 stars 9 forks source link

Proposal: `equation` keyword #56

Open ExeVirus opened 2 days ago

ExeVirus commented 2 days ago

This is an initial proposal, and will be updated in the future to conform to contribution guidelines for Active Vocabularies

This proposal is for the addition of an "equation" keyword to the JSON Schema specification. This keyword allows expressing complex validation logic using a human-readable, regex-inspired syntax called Mathex. This simplifies validation rules, especially those involving multiple properties, and reduces the reliance on complex, language-specific code.

Motivation:

Current JSON Schema validation primarily focuses on individual properties (type, range, format, etc.). Expressing complex relationships between properties often requires external validation code or custom keywords. This can lead to:

Relying on external code for validation logic is exactly what a schema is meant to resolve.

Furthermore, there have been many open issues related to numeric validation beyond the very limited type, min, max, etc. keywords. These almost always end up being some sort of equation or dependent logic.

Proposed Solution:

The equation keyword addresses many of these challenges by introducing Mathex, a concise and portable validation language, directly within the schema.

1. equation Keyword for Single Properties:**

For individual properties, the "equation" keyword takes a string representing a Mathex equation. The value of the property being validated is substituted for the variable 'A' in the equation.

{
  "type": "object",
  "properties": {
    "age": {
      "type": "integer",
      "equation": "A > 0 && A % 2 == 0" // Validates that age is greater than 0 and even
    }
  }
}

2. equation Keyword for Objects (using an array):

For validating relationships between multiple properties within an object, the "equation" keyword is an array. The first element is a string representing the Mathex equation. Subsequent elements are strings containing relative JSON Pointers corresponding to the variables used in the equation, in order of appearance.

{
  "type": "object",
  "properties": {
    "width": { "type": "number" },
    "height": { "type": "number" },
    "area": { "type": "number" }
  },
  "equation": [
    "A == B * C", // The equation
    "/area",      // First variable  (/area)
    "/width",     // Second variable (/width)
    "/height"     // Third variable  (/height)
  ]
}

3. Mathex Syntax and Semantics:

Mathex follows C++ operator precedence with the addition of '^' for exponentiation. It supports:

For a more detailed understanding of what mathex proposes beyond what plain old high school algrebra provides, see a draft lua implementation here

4. Implementation and Validation:

JSON Schema validators would need to incorporate a Mathex interpreter. The validator would:

5. Error Reporting:

Clear error messages should indicate the failed equation and involved properties, associating JSON Pointers with their corresponding variables.

Example: Complex Validation:

{
  "type": "object",
  "properties": {
    "score": { "type": "integer" },
    "bonus": { "type": "integer" },
    "level": { "type": "integer" }
  },
  "equation": [
    "(A + B) > 1000 && C >= 5", // Equation
    "/score",                   // A = /score
    "/bonus",                  // B = /bonus
    "/level"                     // C = /level
  ]
}

6. Aircraft Altitude Validation Example:

This example demonstrates validating an aircraft's altitude within a specified range using the equation keyword and ECEF coordinates. We'll assume a simplified Earth model where the Earth's radius is a constant, and altitude is simply the distance from the Earth's center minus the Earth's radius, with a minimum elevation of -500 feet and max of 500,000 feet.

{
  "type": "object",
  "properties": {
    "x": { "type": "number", "description": "ECEF X coordinate (meters)" },
    "y": { "type": "number", "description": "ECEF Y coordinate (meters)" },
    "z": { "type": "number", "description": "ECEF Z coordinate (meters)" }
  },
  "equation": [
    "((A^2 + B^2 + C^2)^(0.5) - 6371000) > -152.4 && ((A^2 + B^2 + C^2)^(0.5) - 6371000) < 152400",
    "/x",
    "/y",
    "/z"
  ]
}

Explanation:

Summary

gregsdennis commented 1 day ago

It's an interesting proposal. I have a single immediate comment, but I'll read through in more detail later.

I have an allergic reaction to index-significant arrays. While more verbose, I think an object would serve better here.

{
  "equation": "(A + B) > 1000 && C >= 5", // Equation
  "A": "/score",                   // A = /score
  "B": "/bonus",                  // B = /bonus
  "C": "/level"                     // C = /level
}

This would also allow users to use the variables that make sense to them instead of a pre-defined A, B, C, ... variable set.


Quite separately, have you looked at JSON-e? It can actually do quite a lot of what you're proposing.

jviotti commented 1 day ago

:+1: On JSON-e. I think we should avoid defining stringified complex computational syntax.

mwadams commented 1 day ago

@jviotti And yet we support Regex 😉

mwadams commented 1 day ago

I am all for custom vocabularies that support this kind of "business logic" validation.

I think that it is worth (re-)stating my view of the core vocabulary which is that it gives you the tools to ensure that the JSON is structurally valid, and that a consumer can reason about the platform-primitive data types and structures it can use to then process that JSON. Business rule validation is one such process.

However a custom vocabulary can embody any business rules and access to internal/external state, and then be applied either concurrently or separately.

On this particular proposal - I agree with @gregsdennis's suggestion that it becomes an object. While I do think tuples have their place, I think the ability to name the parameters to the equation is really important.

Once concern I have is about numerical types.

This is OK for lowest-common-denominator floating point math, but how does it deal with floating point equality, integers, overflows, and interactions with other constraints like format?

This kind of thing is very "implementation dependent" but we are elevating it up to "vocabulary" level - and that needs strict definitions for how all the operators work.

ExeVirus commented 1 day ago

Great feedback,

Being transparent: I'm proposing the addition both to help json-schema but also trying to introduce the concept of a standardized mathex to augment regex in many (all?) languages.

So both your points introduce a complexity I'll have to think through. For tokenization of the string expression, ALL CAPS makes life a lot easier, and I agree there's a lot of value in named variables.

At initial thought, I see no reason it cannot be:

"AREA == LENGTH * WIDTH"

Some languages do not support nicely named variables like json inherantly does, is why I went with A, B, C originally. For example, in most programming languages, to use mathex you'd be doing a call like:

If(mathex("AREA == LENGTH * WIDTH", area, length, width))

But order of appearance is acceptable for those languages.

In the case of json-schema I see no reason the equation cannot be an object:

"equation": { "expression": "AREA == LENGTH * WIDTH", "variables: { "AREA": "/area", "LENGTH": "/length", "WIDTH": 55, } }

Now, that said there is an issue I forsee related to meta schema validation. Specifically, I cannot specify a validator for the equation expression, so only at runtime or in the schema reading implementation would it become apparent that you had made a syntax mistake:

E.g. "AREA =+= LENGTH * WIDTH" < error

As for the final point about the complexity of dealing with integers, floats, etc. I have been working through those myself, but it has not been fully finished, as mathex itself is still in a draft state. In this sense it's not regex.

The thinking is, yes, floating point as the default, while allowing for 'floor()' to allow coercing to integers explicitly, and also supporting implicit floor() in the case of integer operators like %, |, &, etc. the integer being coerced to is an int64 in all cases. Unsure yet if I need to support a uint64() or a int64() function for explicit coercion beyond floor, this is still up to change.

As an example, this expression is true in mathex:

"15.16 % 3.2 == 0"

Which is equivalent to:

"floor(15.16) % floor(3.2) == 0"

Because the % operator forces integer coercion.

If, at this time, there is continued interest in this concept, I can put more work into mathex itself to define these different mechanisms, but it's unlikely for me to accomplish detailed tests in a short time period. I.e. this might be a 1-2 year away proposal. But if figured I'd start with the end in mind and get feedback early before further adding rigor to the concept.

ExeVirus commented 1 day ago

And to address json-e, i.e. "eval in a tree-like structure", it's absolutely a more enlightened approach to the problem of expression construction.

In the case of something like json-schema, it may make more sense than a plaintext, flat file representation.

Both will have extremely similar end states and uses, with json-e winning out with readability and writability in an editor, while losing in portability and ease to pickup.

So if that's a tradeoff that you're willing to take, then go down the route of json-e and feel free to recommend a close to this proposal (or at the very least a wait until mathex is either more rigorous and standardized or dead)

mwadams commented 1 day ago

For what it is worth, I really like the idea of a well-defined, rigorous MathEx.

gregsdennis commented 1 day ago

Some languages do not support nicely named variables like json inherantly does

  1. JSON doesn't support variables; it's not a programming language. It supports property names, but those names can be literally any string.
  2. You're building the syntax, so you get to make the rules. You don't have to consider what other programming languages use except maybe to consider what others may be used to. DO NOT expect these expressions to be directly interpreted by the implementation in the underlying language. Original JSON Path did that, and it resulted in paths that were highly incompatible between implementations. For RFC 9535, we did away with that and defined our own (albeit less expressive) expression syntax. JSON-e does the same.

If(mathex("AREA == LENGTH * WIDTH", area, length, width))

Because the % operator forces integer coercion.

These two tell me that you're working with some bias to a specific language, though I can't say what it might be.

The "mathex" function/expression isn't a standard feature across languages (or even common in the ones I've seen). Javascript has an eval() which might be similar, and I expect other interpreted languages have some equivalent, but compiled languages usually don't.

Also, in many languages (e.g. C#), the % operator doesn't coerce integers.

this might be a 1-2 year away proposal.

You should have a look through the other proposals. A lot of them have been around for quite a bit longer than that. Most of them just kind of stagnated. It'll be up to you to continue developing and promoting this. When we see more interest, the next step would be to pull it into the new feature proposal process.

Of course, there's also nothing stopping you from implementing support for it in your favorite implementation, if it supports custom keywords.

... json-e winning out with readability and writability in an editor, while losing in portability and ease to pickup.

Having implemented them both (json-everything), I can say that they do serve quite different purposes. There are definitely things you can do with both, but that overlap is pretty small, I think. While you can drive a screw with a pair of pliers, you should probably just use a screwdriver.

For your case, where you want to evaluate expressions, I think JSON-e is the right tool.

But as I said before, feel free to continue developing and promoting this feature. I'll leave the issue open.

ExeVirus commented 1 day ago

You don't have to consider what other programming languages use except maybe to consider what others may be used to. DO NOT expect these expressions to be directly interpreted by the implementation in the underlying language.

The only point in such a limited plaintext expression syntax is purely this goal (interpreted by the implementation). Regex is a language, that is useful mostly because of it's ubiquity across languages, and is always directly interpreted by the underlying language. It does have differences in supported regex across all of them, but the core syntax is really very similar/identical. That is the goal of a MathEx: provide a common, portable language for expressions, purely for the goal of matching/validation like regex.

Thanks for the feedback - as for the "specific language" yes I am aware languages treat operators differently. The project will require an effort in selecting lowest common denominators for most languages and implementations in more restricted ones will likely have to add functions to support the common set.

So while C# supports % with floats, that may not be how MathEx ends up specifying it as such if most languages support only the integer version of %. Most of my language experience is C/javascript/lua/python/Java/visual basic/zig and I'm aware those biases are showing at this early stage haha.

jviotti commented 1 day ago

@jviotti And yet we support Regex 😉

@mwadams True, but then regexes are somewhat standardised. My point was to indeed take something like JSON-e instead of defining our own custom thing. There are surely many already specified transformation languages out there to adopt vs invent.

mwadams commented 1 day ago

@jviotti And yet we support Regex 😉

@mwadams True, but then regexes are somewhat standardised. My point was to indeed take something like JSON-e instead of defining our own custom thing. There are surely many already specified transformation languages out there to adopt vs invent.

Absolutely: regex is "somewhat" standard and yet it still causes us no end of trouble.

Independent of this vocab proposal, I do like the idea of a well-standardized and widely adopted math expression syntax designed for interoperability.