`Boolean` and `Value` (and their optional variants) should be distinct

ietf-wg-jsonpath / draft-ietf-jsonpath-base

Development of a JSONPath internet draft

https://ietf-wg-jsonpath.github.io/draft-ietf-jsonpath-base/

Other

59 stars 20 forks source link

`Boolean` and `Value` (and their optional variants) should be distinct #387

Closed gregsdennis closed 1 year ago

gregsdennis commented 1 year ago

I think this relates to #366.

An expression boolean (e.g. the result of @.a==2) is not a JSON value. This is something that we decided long ago. However Table 13 describes it as "Value(true) or Value(false)". This is wrong.

If we allow OptionalBoolean to be a subtype of OptionalValue, then paths like $[?length(1)] and $[?match(@.a,@.b)=="string"] will be valid and produce no parsing errors. In #360 (see also #365), we decided that this wouldn't be the case. match() only operates as a boolean, and length() only operates as a value.

OptionalBoolean and OptionalValue MUST be distinct in order for this to work.

Edit

This issue is about the role a function plays within an expression. That role is determined by its return value.

cabo commented 1 year ago

Since we don't allow comparison as a function argument, I don't think 1==2 has a type.

gregsdennis commented 1 year ago

I don't see how that's relevant. I'm not talking about argument types. I'm talking about return types and the role of functions within expressions.

match() returns an expression boolean (not a JSON boolean), which is valid where booleans are valid in expressions, i.e. not in comparisons.

length() returns a JSON-like value, which is valid where values are valid in expressions, i.e. only in comparisons.

gregsdennis commented 1 year ago

This distinction is implied because of the decisions that we've already made

in the issue and PR linked above
that we aren't going to support type coercion
that a JSON-true is distinct from expression-true

gregsdennis commented 1 year ago

From Section 2.6.2:

If it occurs as a filter-path in a test expression, the function is defined to have result type OptionalNodes or one of its subtypes, or to have result type OptionalBoolean or one of its subtypes.

If it occurs as a comparable in a comparison, the function is defined to have result type OptionalNodeOrValue or one of its subtypes.

If we allow OptionalBoolean to be a subtype of OptionalNodeOrValue, then the above decisions are contradicted.

gregsdennis commented 1 year ago

In the interim we discussed that a function can return one of three kinds of data:

a value (e.g. length())
a boolean (e.g. match())
a nodelist (e.g. a hypothetical distinct() function)

Value

Value functions can only be used in a comparison (i.e. as a comparable). This wasn't disputed.

Boolean

I argue that boolean functions can only be used as operands in a logical expression. Effectively they are test-expr. The value that they return cannot be translated/lowered/casted/converted/coerced into the JSON literals true and false and so they are not comparable. (This discussion prompted #389, as linked above.)

Nodelist

We don't have any functions that return a nodelist currently. The hypothetical distinct() function (or something like it) that takes a nodelist and returns that nodelist with duplicates removed was proposed during the interim. A use case could be for finding elements with more than x distinct values, like $[?count(disctinct(@))>1].

The key here, though, is that this would only ever be used as:

an argument for another function (as above), which isn't part of this discussion
a test-expr, which is the same as the boolean return scenario (because nodelists aren't comparable).

The role of this return type is the same as a boolean return type.

What I want to highlight from this discussion is that a boolean isn't a value, so Table 13 is wrong when it lists

Boolean as a subtype for Value
OptionalBoolean as a subtype for OptionalValue

There was an interesting scenario that was mentioned in the interim: $[?match(@.a, 'a.*')==@.b]. (This may need to be moved to another issue, but it came up during this discussion, so it's here. For now, I just want to explore the use case.)

This finds all elements where @.b contains the correct result of whether @.a matches 'a.*', for example in this data:

[
  {"a": "abc", "b": true },  // is returned
  {"a": "bcd", "b": false },  // is returned
  {"a": "abc", "b": false },  // is not returned
]

(The inverse $[?match(@.a, 'a.*')!=@.b] is arguably more useful, but that aside.)

Importantly, this is only possible if the result of match() can be translated/lowered/casted/converted/coerced into the JSON literals true and false to be considered as comparable. (Again, see #389.)

I can see a use case for something similar in the test suite actually.

Let's assume we have an eval() function that takes a path and some data and returns the nodelist result of evaluting the path against the data. (We'll disregard any security or other practical issues of doing that for this purpose of this example.)

The test suite is represented as an object with a single key tests that contains an array of scenarios. Each scenario looks something like this:

{
  "name": "filter, existence, present with null",
  "selector" : "$[?@.a]",
  "document" : [{"a": null, "d": "e"}, {"b":"c", "d": "f"}],
  "result": [
    {"a": null, "d": "e"}
  ]
}

Now suppose we wanted to query this document for all of the test cases which are incorrect. We could use this path:

$.tests[?eval(@.selector, @.document)!=@.result]

Ideally, this would return an empty nodelist for a compliant implementation.

Something like this would be a great CI sanity check for the test suite.

(This exact path would imply that nodelists are comparable, which they're not, but we could work around that by introducing a nodelistsEqual() function or something. This problem doesn't apply to the $[?match(@.a, 'a.*')==@.b] example since that's attempting to compare things that are more "boolean"-y.)

glyn commented 1 year ago

An expression boolean (e.g. the result of @.a==2) is not a JSON value. This is something that we decided long ago. However Table 13 describes it as "Value(true) or Value(false)". This is wrong.

I'm not sure what "it" is referring to, but Table 13 says that BooleanType has abstract instances Value(true) and Value(false). Table 13 doesn't mention expression booleans.

gregsdennis commented 1 year ago

but Table 13 says...

You can't use Table 13 in your argument. The basis of my claim is that the table is wrong.

I'm not sure what "it" is referring to

"It" in my comment is the "expression boolean," or what we started calling TestBoolean.

glyn commented 1 year ago

I am trying to understand the second paragraph in the description of this issue. After substituting "expression boolean" for "it", the paragraph reads:

An expression boolean (e.g. the result of @.a==2) is not a JSON value. This is something that we decided long ago. However Table 13 describes [expression boolean] as "Value(true) or Value(false)". This is wrong.

I don't understand how Table 13 says anything at all about expression booleans. Please can you explain?

gregsdennis commented 1 year ago

... pedantry... okay.

It should be

However Table 13 describes Boolean as "Value(true) or Value(false)". This is wrong.

I'm saying Boolean is what we're now calling TestBoolean, which is not a JSON value.

glyn commented 1 year ago

I disagree that BooleanType represents a "test boolean". The spec makes this clear:

BooleanType is an abstraction of a primitive value that is either true or false.

The tricky thing, which is what I suspect we should be focussing on, is in the following:

A test expression either tests the existence of a node designated by an embedded query (see Section "Existence Tests") or tests the result of a function expression (see Section 2.6). In the latter case, if the function expression is of type OptionalBooleanType or one of its subtypes, it tests whether the result is true; [...]

Note that the special case in this paragraph only applies where the return type of the function expression is defined to be OptionalBooleanType (or one of its subtypes). The special case is not triggered by a return value being a boolean value (such as Value(true)). In other words, it can be determined before the query is executed.

cabo commented 1 year ago

I think this is an important point -- while JSON values are dynamically typed (there aren't really JSON types anyway), the OptionalBooleanType is statically confined to have three members: Nothing, Value(false), Value(true). So it is compatible with an OptionalValueType recipient, and with being used in a comparable. It is also compatible with being used in a test-expr.

cabo commented 1 year ago

(And the point is that the function expression type system is statically typed.)

gregsdennis commented 1 year ago

The spec makes this clear: - @glyn

You keep using the spec as evidence. I'm arguing that the spec is wrong. You can't use it as evidence.

I'm saying that the spec shouldn't define BooleanType this way.

It is also compatible with being used in a test-expr. - @cabo

How is OptionalBooleanType compatible with test-expr if it represents a JSON value? This is the argument in #389.

You're saying that a value of OptionalBooleanType (Nothing/true/false) can be used in a test-expr. This means that false can be used in a test-expr because false is OptionalBooleanType, which implies $[?false] is a valid expression.

We've already decided that this is an invalid expression, as linked above, so we have a contradiction.

The only way to resolve this contradiction is to make BooleanType (and by extension OptionalBooleanType) akin to #389's TestBoolean, not the JSON values true and false.

Taking this further, BooleanType cannot be a subtype of ValueType and by extension OptionalBooleanType cannot be a subtype of OptionalValueType. These are disjoint and operate under different roles in expressions.

Additionally, if BooleanType truly does represent JSON true and false, why make a type for it and not make NumberType and StringType? This definition is an incomplete type system.

Further, why make the type system stop at functions? We've defined these types for functions, but then they don't extend into the expressions in which they exist, when it's clearly beneficial to do so (even if it requires more work).

cabo commented 1 year ago

Hi Greg,

we now know that you want TestBooleans to be distinct from JSON value Booleans. The only argument that you present is that you want them to be different. This is fine; intuition often is useful in designing things. However, other people have the inverse intuition, so we won't get around discussing this with more technical arguments.

The purpose of the type system we have been designing is not to replace or augment the JSON type system (which doesn't exist, I'd argue, but that is a different discussion). The purpose is to be able to statically check how functions fit together, and how arguments for functions can be derived from JSONPath filter expressions and how returns from functions can feed back into JSONPath filter expressions.

It turns out we have three places where function expressions fit: In a test-expr, in a filter-path, and in a comparable. We wouldn't need to list filter-path here; for some reason we have made this one of the two things that can go into a function argument, along with comparable. This anomaly in the grammar can probably be fixed.

The current definition of the function expression type system is clouded by the fact that there is both a subtype relationship and a compatibility relationship (more specifically, a type can be used in place of another type, which involves coercion of the value).

We care about:

nodelists, which can also stand in for a test (by checking whether the nodelist is empty)
single nodes (which almost always need to be optional), which can stand in for a value
values, which can also be used as comparable; these are often, but not always Optional.

Functions like match or search return a special kind of optional value, which can be used in test expressions, and as a comparable, and as an argument for a function that expects just such a an instance and/or a JSONValue . You don't want them to be used in comparables or as function arguments where a JSONValue can be used.

Why.

(Note that you can't argue from the grammar, as that doesn't express the function argument/return value type system. Of course, you can ask why literal true and literal false cannot be used in a logical expression, and I'd say that this is a little wart, but it also isn't particularly useful.)

gregsdennis commented 1 year ago

The only argument that you present is that you want them to be different.

No, I've presented quite a lot of logical evidence that shows they MUST be different. What we have now is rife with contradiction.

The purpose is to be able to statically check how functions fit together, and how arguments for functions can be derived from JSONPath filter expressions and how returns from functions can feed back into JSONPath filter expressions.

I agree with this stated purpose. We have failed to fulfil this purpose.

It turns out we have three places where function expressions fit: In a test-expr, in a filter-path, and in a comparable.

What we have failed to do is identify when it's appropriate for a function to appear in each of these places. An individual function needs to be identified as valid as either test-expr or (XOR if you prefer) comparable, but it can't be valid as both.

My evidence is match(), which (as currently defined) is capable of returning JSON true or false. These are comparables. Everyone is saying that match() can be used as a test-expr, but JSON true and false cannot be used as test-expr. (That's the contradiction.) Because match() has the potential to return these JSON values, the function MUST be restricted to locations where those values are valid, namely comparable.

This would mean that $[?match(@.timezone, 'Europe/.*')] MUST be invalid. If match() returns false here, it reduces to $[?false], which we have declared invalid in #180 in order to resolve an ambiguity.

How are you not seeing this very obvious contradiction?

gregsdennis commented 1 year ago

This would mean that $[?match(@.timezone, 'Europe/.*')] MUST be invalid. If match() returns false here, it reduces to $[?false], which we have declared invalid in #180 in order to resolve an ambiguity.

The solution to making $[?match(@.timezone, 'Europe/.*')] valid is to make it return what #389 calls a "TestBoolean," optional or not. But that means it's no longer valid as a comparable and $[?match(@.timezone, 'Europe/.*')==true] or even $[?match(@.timezone, 'Europe/.*')==@.a] become invalid.

cabo commented 1 year ago

Everyone is saying that match() can be used as a test-expr, but JSON true and false cannot be used as test-expr. (That's the contradiction.)

Note that the first half of this sentence is about the type system, and the second appears to be about the grammar (which just doesn't allow literals as test-expr). I tried to clean up the grammar (without changing it) over in #394 so that may become more obvious.

gregsdennis commented 1 year ago

Note that the first half of this sentence is about the type system, and the second appears to be about the grammar (which just doesn't allow literals as test-expr)

The grammar and the type system need to align.

cabo commented 1 year ago

Note that the first half of this sentence is about the type system, and the second appears to be about the grammar (which just doesn't allow literals as test-expr)

The grammar and the type system need to align.

That is a desirable, but there are some other desirables in conflict with that. The grammar can very well disallow confusing expressions that do not violate the type system.

gregsdennis commented 1 year ago

there are some other desirables in conflict with that.

Like what? What desirable can be so important that it overrides a logical contradiction?

gregsdennis commented 1 year ago

The grammar can very well disallow confusing expressions that do not violate the type system.

I'm saying the type system is allowing something that the grammar disallows.

glyn commented 1 year ago

You keep using the spec as evidence. I'm arguing that the spec is wrong. You can't use it as evidence.

What is obvious to you is far from obvious to me. My starting point is that the aspect of the spec being scrutinised by this issue is actually consistent and maybe just needs tweaking or explaining better. I've tried to point out what I believe to be the root cause of the confusion. (I am now essentially AFK for a while, so apologies for any delayed responses.)

cabo commented 1 year ago

The underlying requirement is that we want to convert Nodelists into Booleans when used in a test-expr (true if non-empty). We want to convert OptionalNodes into OptionalValues (by looking up the JSON value) when used in a comparable. We somehow need to map this into the function-expr type system as well. Since a function-argument says whether it is a nodelist or a value, this transfers one to one. Smearing up the type system by equating subtyping with coercion does not help, though. I would prefer to have the conversion from a Nodelist to an OptionalNode be explicit (possibly supported by a function that does just that). Then we don't have an unclear situation when a Nodelist is offered as an argument to a function that declares this as a Boolean.

gregsdennis commented 1 year ago

@glyn / @cabo, I think I have a good explanation of the problem.

A couple definitions and conventions

In this post, I will be using

angle brackets <> to denote a nodelist to distinguish it from a JSON array
true and false (code formatting) to indicate JSON literal values
true and false (italics) to indicate the operands of a logical operation

I would also like to use a couple definitions for expressions:

A comparison context is any that involves a comparison operator, like == and <.
A logical context is any that either can be used as the final result of an expression or involves a logical operator, !, &&, or ||.

These are the only two contexts which exists in our expressions.

Further, it's important to recognize that the result of a comparison is used in a logical context.

`@.a` as an existence test implies a distinction

We decided that @.a in a logical context was to be interpreted as an existence test. This means that if an item had an a property, regardless of its value, the result of @.a would be a logical true to select both items in

[
  { "a": true },
  { "a": false }
]

To illustrate why, consider an implementation that interprets the value at @.a as the result of the expression. It would consider the JSON true as logical true and select the first item, but it would consider JSON false as logical false and not select the second. With this behavior, there is be no way to select the node with false, so we decided this was to always be interpreted as an existence test.

This showed that JSON values, even true and false, only have meaning in a comparison context.

Thus we made a distinction between

the JSON literals true and false and
the logical values true and false.

We then codified this decision by making JSON literals (true, false, null) and other raw JSON values (strings, numbers) only valid in a comparison context using the ABNF.

The problem with `match()` and `search()`

These functions are defined to return OptionalBoolean, which is defined by Table 13 to be the JSON literals true and false and the value Nothing. Because they return JSON values, and following from the decisions above, they can only be used in a comparison context. That is, they must be compared to another JSON value using a comparison operator to yield a logical result; they cannot be used directly in a logical context.

But we also want these functions to be valid in both a comparison context and a logical context. This is the contradiction. They way they're defined, they cannot be used in both contexts.

My original suggestion was to have the functions instead return a logical true or false instead of the JSON values. However, this only reverses the contraction rather than solving it.

This weekend, I thought about how to define them so that they work in both contexts. I started by analyzing a construct that already does: a path.

An analysis of `@.a`

@.a can exist in both a comparison context and a logical context, and it always returns a nodelist. How that nodelist is interpreted in each context determines the different behaviors it exhibits.

To illustrate, let's take a look at what happens for @.a in a comparison context with $[?@.a==false] (1) and in a logical context $[?@.a] (2) when evaluating

[
  { "a": false },
  { "b": "foo" }
]

Most importantly, @.a always returns <false> and <>, respectively, for these items, regardless of the context.

In a comparison context,
- the single value is extracted from <false> and ==-compared with false, and the node is selected.
- <> is converted to Nothing and ==-compared with false, and the node is not selected.
In a logical context,
- <false> contains a node, so it evaluates to logical true, and the node is selected.
- <> contains no nodes, so it evaluates to logical false, and the node is not selected.

@.a returns the same nodelist for both contexts because it's unaware of that context. It's just a path evaluating data.

While @.a always returns the same nodelist, how that nodelist is interpreted changes with context. Specifically,

in a comparison context, <false> is interpreted as JSON false.
in a logical context, <false> is interpreted as logical true.

Applying the analysis to `match()` and `search()`

In order to make these functions valid in both contexts, they need to return something that is valid in both contexts, and the only thing we have that works in both contexts is a nodelist, as demonstrated by @.a.

Let's perform our analysis again for the comparison context with $[?match(@.timezone, 'Europe/.*')==false] (1) and the logical context with $[?!match(@.timezone, 'Europe/.*')] (2). Theoretically, these should select the same nodes, namely nodes that DO NOT match.

Starting with the affirmative case where the function identifies a match, let's say that match() returns <true>.

In a comparison context, the single value is extracted from <true> and ==-compared with false, and the node is not selected.
In a logical context, <true> has nodes so is evaluated to logical true and negated by !, and the node is not selected.

Good so far. Both paths return the same thing for cases where a match is found.

Now, let's look at the negative case where the function does not identify a match. For this, let's say that the function returns <false>.

In a comparison context, the single value is extracted from <false> and ==-compared with false, and the node is selected.
In a logical context, <false> has nodes so is evaluated to logical true and negated by !, and the node is NOT selected.

Uh, oh. We have differing results. This doesn't work.

Let's try returning <> for the negative case instead.

In a comparison context, <> is converted to Nothing and ==-compared with false, and the node is NOT selected.
In a logical context, <> has no nodes so is evaluated to logical false and negated by !, and the node is selected.

Hm... that doesn't work either.

In fact, there is NO SINGLE VALUE that match() can return to make these paths select the same nodes for the non-matching case.

The only recourse, then is to restrict match() to only one of:

comparison context: $[?match(@.timezone, 'Europe/.*')==false]
logical context: $[?!match(@.timezone, 'Europe/.*')]

match() and search() can't support both.

Doing this would mean that when match() appears outside of its decided context, it will not be considered well-formed, requiring a parsing error (or compilation error, if you prefer).

Note that doesn't preclude some other function from returning a nodelist in a manner that is consistent between contexts.

glyn commented 1 year ago

@gregsdennis thanks for the detailed analysis. The current spec defines a special case which produces different behaviour. Section 2.5.5 says (emphasis added):

A test expression either tests the existence of a node designated by an embedded query (see Section "Existence Tests") or tests the result of a function expression (see Section 2.6). In the latter case, if the function expression is of type ST(OptionalBooleanType) (see Section 2.6.1), it tests whether the result is true; if the function expression is of type ST(OptionalNodesType), it tests whether the result is different from Nothing.

So, when a function returning an OptionalBoolean (or one of its subtypes) is used in a test expression, this is not treated as an existence test. Instead, the value returned by the function is compared against true. So, for the three possible return values, Value(true) yields true whereas Value(false) and Nothing yield false.

On the other hand, when a function returning an OptionalBoolean (or one of its subtypes) is used in a comparison, the value returned by the function is compared to the other side of the comparison.

Thus $[?match(@.timezone, 'Europe/.*')==false] and $[?!match(@.timezone, 'Europe/.*')] produce identical nodelists.

The crucial thing to note, and which avoids the above contradiction, is that a function returning an OptionalBoolean (or one of its subtypes) is treated differently depending on where it is used. This difference is determined at parse/compile time. During execution, the return value of the function is treated accordingly.

gregsdennis commented 1 year ago

I appreciate that the spec says this explicitly.

However, I maintain that this special case needs to be removed.

a function returning an OptionalBoolean... is treated differently depending on where it is used.

This contextual difference in behavior represents an inconsistency in the overall grammar of expressions. Why are functions treated specially? The underlying mechanics should operate the same, and they don't.

The spec very explicitly says that a JSON true/false is not valid in a logical context... unless it comes from a function. Why?! Why special-case functions?

A function (or any expression component) should only be valid where its return type is valid, that is, where a value of that return type is valid. I should be able to replace the function with a value of its return type and still have a valid expression. But if I do that for $[?!match('ab', 'a.*')], I get $[?true] which is explicitly invalid. Therefore we have an inconsistent grammar. That statement from the spec you quoted is merely a bandage trying to cover up a more serious problem. In order to have a consistent grammar, this sort of value substitution MUST work.

Finally, as the only person who has even attempted to implement this, I find it confusing and difficult, which should be more than enough reason to change it.

cabo commented 1 year ago

On 14. Feb 2023, at 20:53, Greg Dennis @.***> wrote:

I appreciate that the spec says this explicitly. However, I maintain that this special case needs to be removed.

We don’t agree.

a function returning an OptionalBoolean... is treated differently depending on where it is used. This contextual difference in behavior represents an inconsistency in the overall grammar of expressions. Why are functions treated specially? The underlying mechanics should operate the same, and they don't.

They are not treated specially. It is just a case that only applies to functions.

The spec very explicitly says that a JSON true/false is not valid in a logical context…

No. It doesn’t provide a way to put JSON literals there, so the issue only occurs for function returns.

unless it comes from a function. Why?! Why special-case functions?

(Because it is the only case.)

A function (or any expression component) should only be valid where its return type is valid,

(If talking about the type system:). Yes. Maybe we should say “well-typed” instead of “valid”...

that is, where a value of that return type is valid. I should be able to replace the function with a value of its return type and still have a valid expression.

Where you can’t notate that value, this is moot.

But if I do that for $[?!match('ab', 'a.*')], I get $[?true] which is explicitly invalid.

It is not invalid, it is not well-formed.

Therefore we have an inconsistent grammar.

I don’t follow.

That statement from the spec you quoted is merely a bandage trying to cover up a more serious problem.

You could say that. The original problem is that @.a is not saying whether it talks about the node(s) or about the value(s) there. We have made sure this is well-defined in all cases: in a test-expr, it means (the existence) of nodes, and in a comparison-expr (where we restrict the paths to singular ones) it means the value (or Nothing if no node). Function expressions straddle this boundary, so something needed to be done. Hence the type system, which is not about values.

In order to have a consistent grammar, this sort of value substitution MUST work.

I don’t follow.

Finally, as the only person who has even attempted to implement this, I find it confusing and difficult, which should be more than enough reason to change it.

Maybe we can learn from this exercise that a mental model that conflates values and types doesn’t work too well here.

Grüße, Carsten

gregsdennis commented 1 year ago

They are not treated specially. It is just a case that only applies to functions.

THAT'S WHAT A SPECIAL CASE IS! A special case is precisely a case that only applies to one thing.

It doesn’t provide a way to put JSON literals there

Precisely! JSON literals can't go there because the values are invalid there.

(Because it is the only case.)

That's not a reason. You're saying functions are special-cased because they're special cases. They shouldn't be. They should be treated like everything else in the expression. Special casing yields an inconsistent grammar.

Therefore we have an inconsistent grammar.

I don't follow

It's inconsistent because the same grammar can behave differently. $[?@.a] and $[?func(@)] have the same grammar: @.a and func(@) both act in a logical capacity (i.e. test-expr). They should behave the same: return a nodelist which is analyzed for contents. But they don't; they are inconsistent in their behavior.

There's no reason for them to behave differently except that the spec says that they do.

When evaluating an expression such as a && b==c, I must first analyze the components, a, b, and c. From those, I get values which can be operated on by && and ==. Those values MUST be of the correct type for those operators. The operators, though, have no knowledge of what produced the values.

But you're saying that && needs to know if a is a function or a path, and depending on which, it changes its behavior. That's inconsistent behavior. && should do one thing, and it should do that thing without context.

that is, where a value of that return type is valid. I should be able to replace the function with a value of its return type and still have a valid expression.

Where you can’t notate that value, this is moot.

But if I do that for $[?!match('ab', 'a.*')], I get $[?true] which is explicitly invalid.

It is not invalid, it is not well-formed.

For the purpose of this discussion, "valid" means that it doesn't produce an error and can be evaluated. I'm not making a distinction between being ABNF-valid and well-formed.

I don't understand what you mean by "notating" the value. The example I gave shows that what would be considered a valid expression ($[?!match('ab', 'a.*')]) is actually invalid when you consider the return type $[?true].

The original problem is that @.a is not saying whether it talks about the node(s) or about the value(s) there. We have made sure this is well-defined in all cases: in a test-expr, it means (the existence) of nodes, and in a comparison-expr (where we restrict the paths to singular ones) it means the value (or Nothing if no node).

The problem is that, within a single context, you're defining one behavior for a path and another for a function. Functions should behave the same as paths in the same contexts.

Hence the type system, which is not about values.

The spec disagrees with you: "A type is a set of instances." The type system is very much about values.

(Note that the spec doesn't define "instance." This section is the only place the word is used.

In order to have a consistent grammar, this sort of value substitution MUST work.

I don’t follow.

This is the example I gave.

match('ab', 'a.*') always finds a match. According to the spec, it should return JSON true. Thus the expectation is that $[?!match('ab', 'a.*')] will return a nodelist containing all of the nodes of the subject data.

Because the return value is always the same, I should be able to substitute the return value into the expression and yield another valid (and well-formed) expression that yields the same result. For example, in a && b == b, I can substitue a logical true (not JSON true) for b==b and get a && _true_.

Performing that substitution on match('ab', 'a.*') yields $[?true] (that's the JSON literal true). This is an invalid expression. It produces an error.

Becuase $[?!match('ab', 'a.*')] and $[?true] do not produce the same outcome, we have an inconsistency.

It follows that if we are to have a consistent grammar and $[?true] is invalid (explicitly via the ABNF), $[?!match('ab', 'a.*')] must also be invalid.

This kind of substitution exists in every programming language and every mathematical convention I've ever seen. It is the very definition of logical consistency. It is foolish of us to define JSON Path expressions in a way that operates in a different manner.

Maybe we can learn from this exercise that a mental model that conflates values and types doesn’t work too well here.

As mentioned, I'm not conflating values and types. The spec clearly defines a type as a set of instances/values.

I understand the difference. To accuse me of misunderstanding this is rude, especially knowing that I come from .Net and C#, which are very strictly typed.

You're very clearly holding onto keeping this for some reason despite the many logical arguments I've made against it. I don't understand why. Is it pride? Is it unwillingness to change? Is it laziness?

For the benefit of the specification and its wide adoption, I urge you to actually consider my arguments and remove this inconsistency from the spec.

gregsdennis commented 1 year ago

☝️ these are the two options I present. I don't see a way to have a consistent typing system and allow match() and search() in both logical and comparative contexts.

The other option I can see is just doing away with the typing system altogether and just checking syntax (and maybe argument count). Then functions can return whatever they like and the expression evaluation system will just run. If it encounters something invalid, then it doesn't select that node, just like it would if @.a returned a string in $[?@.a==42].

gregsdennis commented 1 year ago

@cabo, you have refused to acknowledge that there is anything wrong with the current solution of special-casing function in test expressions, and yet you have also failed to explain why such a special case is acceptable. The only argument you have presented is to quote the spec and say, "The document provides a solution for this." As I'm arguing that the document (and therefore the solution) is wrong, the document cannot be used as evidence.

There is a logical and syntactical inconsistency in the solution that the document presents ("in the document"). I have repeatedly offered multiple analyses, explanations, logical arguments, and examples detailing why the current solution is inconsistent and contradictory. Yet it seems you're not reading them. You have not in any way refuted my objections with why my analyses are incorrect or how the current solution is consistent and non-contradictory (I re-read the issue to be sure), yet you continue to oppose me.

I ask you to provide such reasoning or stand down and accept one of my proposals.

cabo commented 1 year ago

I ask you to provide such reasoning

Yes. Of course.

I had problems responding to your critique because it doesn't seem to me to be aligned with the document. So the logical next step is to improve the document editorially, and then see whether your critique still applies.

Unfortunately, I don't have much time this week, as I'm part of a small group that teaches an intensive course.. I can't really ask the others to do my work while I tend to this. So I ask for a little more patience.

timbray commented 1 year ago

Co-chair hat on.

Clearly we are having a communication problem here. Greg is making two claims:

The language/meaning of the spec contains unacceptable special-casing and/or inconsistency
The current version of the spec is unreasonably hard to implement.

On (1.) he has failed to convince the editors; Carsten has suggested that there is room for editorial improvements to clear the air.

My notes:

I'm inclined to give the editors a chance to see if they can reduce the discomfort and find us a path forward.
I confess that after repeated readings of the spec and the discussion, I'm finding it hard to figure out what is what. In particular I find the language describing the type system to be really opaque.
I see no particular reason why the input type repertoire and output type repertoire for functions need to be identical. See my other issue about the return values for match and search. In particular I suspect that strictly limiting the output type repertoire might simplify the lives of spec readers and implementors too. Options such as limiting return types to JSON primitive types (true, false, number, string, null) or alternately a strictly-constrained set of node-lists may be worth considering?
I'm probably more concerned about Greg's concern (2.), that the current spec is hard to implement. I have painful personal experience with beautifully crafted specs that are unreasonably hard to code up. Greg, perhaps a little more on the difficulty would be useful? It does seem to me that any implementation is going to have some special-casing around the functions, since they are a different kind of beast than the rest of the spec.
If we are unable to come to rough consensus on how to specify the function-extension framework, one plausible path forward is simply to remove it. I think that would be well within the scope of our charter.

gregsdennis commented 1 year ago

@timbray thank you for taking the time to review this issue.

I'm inclined to give the editors a chance to see if they can reduce the discomfort and find us a path forward.

I've been asking for this and haven't received anthing except "the spec says you're wrong." (See below)

I see no particular reason why the input type repertoire and output type repertoire for functions need to be identical.

I agree with your statement, but aligning function input with function output is not what this discussion is about.

This is about how functions should only be valid where values of their return types would be valid. E.g. A function which returns a JSON true or false should only be valid where a JSON true or false literal would be valid. This makes for a consistent type system. For the current document, this is not the case.

Greg, perhaps a little more on the difficulty would be useful?

I've tried to explain it above by describing how typed systems should work, but maybe I can rewrite my arguments from an implementation point of view. I'll work on this and post back.

If we are unable to come to rough consensus on how to specify the function-extension framework, one plausible path forward is simply to remove it.

I might be happy removing the type system (I still need to explore that), but I think we need to keep functions. And that means that we still need to address what happens when a function appears in a place it's not (shouldn't be) expected, like a function that returns a JSON true or false appearing as an operand for && when JSON true or false themselves cannot appear there.

As mentioned in your #400, we need functions to fill in the gaps for "traditional" syntaxes that don't fit within the spec's syntax. You're right, though: removing functions would make this issue moot, but we'd also lose some of the functionality that users are used to when working with JSON Path.

To address the general response that I've received, I need to remove JSON Path from the argument and make the dispute absurd.

Suppose we have the image below, and we want to write a document that describes how to reproduce this image.

There are two scenarios I want to explore:

The document says, "The sky is red." I see this and open an issue saying that the specification is wrong, that it should say that the sky is blue. In this scenario, I am correct in saying the spec is wrong.
The document says, "The sky is blue," and I open an issue saying that the specification is wrong and suggest that it should say that the sky is red. In this scenario, I am incorrect.

In both cases, I lay out an argument.

And in both cases, the response should be, "Let's check the source image," not "No, the spec says that the sky is [color]. It's fine the way it is." We check the source image, and the issue is resolved.

My point is that if I'm saying the spec is wrong, the spec can't be used as evidence to say that the spec is right. The evidence has to come from an external source.

cabo commented 1 year ago

I've been asking for this and haven't received anthing except "the spec says you're wrong." (See below)

The editors did not have much time in the last week, so that's why you didn't get a detailed explanation. Responding to each of these reminders doesn't increase the speed at which I can work. I'll submit a PR with an improved description of function expressions that should make the whole thing moot. But I can't do this in zero time. I do have empathy for your impatience.

This is about how functions should only be valid where values of their return types would be valid. E.g. A function which returns a JSON true or false should only be valid where a JSON true or false literal would be valid.

We can choose to do so, we can choose not to do so. The only reason you have given so far is that you want that to be the case. I believe that allowing JSON literals in the test-expr grammar would be very confusing. But we could. It just isn't needed.

If we are unable to come to rough consensus on how to specify the function-extension framework, one plausible path forward is simply to remove it.

I might be happy removing the type system (I still need to explore that), but I think we need to keep functions.

It is indeed hard to introduce a useful function extension without a type system. The requirement on the type system I'm trying to fulfill is that it meshes with the properties of the non-extended parts of JSONPath, in particular that we can examine well-formed expressions whether they also are well-typed, and that this can be done independently of the actual JSON data that will be fed to the expression. This works best with a type system that is entirely static.

And that means that we still need to address what happens when a function appears in a place it's not (shouldn't be) expected,

Yes.

like a function that returns a JSON true or false appearing as an operand for && when JSON true or false themselves cannot appear there.

No. Again, that equivalence is a preference from you that I don't share. I'm not discussing your blue sky, because we do agree that you do have that preference. We just don't agree about that preference.

If you have a little time left, can you please check #399? I don't think Glyn is getting to this.

gregsdennis commented 1 year ago

The editors did not have much time in the last week, so that's why you didn't get a detailed explanation.

Understood. I await your detailed comments.

Responding to each of these reminders doesn't increase the speed at which I can work.

There was no such reminder until recently.

The only reason you have given so far is that you want that to be the case.

No, I have given several very detailed logically reasoned explanations backing up my case.

I believe that allowing JSON literals in the test-expr grammar would be very confusing.

Yes, it would. I'm not advocating for this.

I'm advocating that because "allowing JSON literals in the test-expr grammar would be very confusing," we also shouldn't allow functions that return JSON literals in test-expr grammar.

The requirement on the type system I'm trying to fulfill is that it meshes with the properties of the non-extended parts of JSONPath, in particular that we can examine well-formed expressions whether they also are well-typed, and that this can be done independently of the actual JSON data that will be fed to the expression. This works best with a type system that is entirely static.

I 100% agree with this.

Again, that equivalence is a preference from you that I don't share.

This is not a preference. It is an aspect of typed systems in general. Functions may only appear where their return values may appear.

glyn commented 1 year ago

I'm AFK until Tuesday, but I wonder if Greg's concerns would be addressed if it was possible to support referential transparency where a function could be replaced, per call (tricky to implement in general), with (a concrete representation of) its return value without changing the result of the query.

On Sun, 19 Feb 2023, 00:40 Greg Dennis, @.***> wrote:

The editors did not have much time in the last week, so that's why you didn't get a detailed explanation.

Understood. I await your detailed comments.

Responding to each of these reminders doesn't increase the speed at which I can work.

There was no such reminder until recently.

The only reason you have given so far is that you want that to be the case.

No, I have given several very detailed logically reasoned explanations backing up my case.

I believe that allowing JSON literals in the test-expr grammar would be very confusing.

Yes, it would. I'm not advocating for this.

I'm advocating that because "allowing JSON literals in the test-expr grammar would be very confusing," we also shouldn't allow functions that return JSON literals in test-expr grammar.

The requirement on the type system I'm trying to fulfill is that it meshes with the properties of the non-extended parts of JSONPath, in particular that we can examine well-formed expressions whether they also are well-typed, and that this can be done independently of the actual JSON data that will be fed to the expression. This works best with a type system that is entirely static.

I 100% agree with this.

Again, that equivalence is a preference from you that I don't share.

This is not a preference. It is an aspect of typed systems in general. Functions may only appear where their return values may appear.

— Reply to this email directly, view it on GitHub https://github.com/ietf-wg-jsonpath/draft-ietf-jsonpath-base/issues/387#issuecomment-1435802521, or unsubscribe https://github.com/notifications/unsubscribe-auth/AAAXF2MIBY2KQH2BPLCB3PTWYFTXZANCNFSM6AAAAAAUTTQTBQ . You are receiving this because you were mentioned.Message ID: @.*** com>

cabo commented 1 year ago

I'm AFK until Tuesday, but I wonder if Greg's concerns would be addressed if it was possible to support referential transparency where a function could be replaced, per call (tricky to implement in general), with (a concrete representation of) its return value without changing the result of the query.

(1) We would need literals for the whole type system. We can't notate Nothing or other node lists at the moment. (2) We would need to tag the literal with the function return type that is intended.

glyn commented 1 year ago

Not sure we would need all those literals. Suppose there was a node whose value was the same as the return value of a function. One referential transparency rule would be that replacing the function call with the path results in the same query result.

But I take your point about possibly needing to tag values with types. This makes me wonder how we position the type system as a natural extension of the syntactic rules for non-function expressions.

On Sun, 19 Feb 2023, 08:41 cabo, @.***> wrote:

I'm AFK until Tuesday, but I wonder if Greg's concerns would be addressed if it was possible to support referential transparency where a function could be replaced, per call (tricky to implement in general), with (a concrete representation of) its return value without changing the result of the query.

(1) We would need literals for the whole type system. We can't notate Nothing or other node lists at the moment. (2) We would need to tag the literal with the function return type that is intended.

— Reply to this email directly, view it on GitHub https://github.com/ietf-wg-jsonpath/draft-ietf-jsonpath-base/issues/387#issuecomment-1435926871, or unsubscribe https://github.com/notifications/unsubscribe-auth/AAAXF2LCHMDNKW32ARGTHB3WYHMD5ANCNFSM6AAAAAAUTTQTBQ . You are receiving this because you were mentioned.Message ID: @.*** com>

gregsdennis commented 1 year ago

I agree that we don't need representation of all values in the syntax. For example, JSON arrays and objects are values that are not allowed in the syntax.

Here's how I see the type landscape. We have three types: values, boolean, and nodelists.

Values

Values include any and all JSON values: arrays, objects, strings, numbers, and the three literals true, false, and null. We don't need any more precise typing. These can all be considered "values."

Also included as a value is Nothing, which we designated to represent the absence of a value, e.g. the value of bar in the object { "foo": 42 }.

Strings, numbers, and the three JSON literals can be expressed in the syntax. JSON objects, JSON arrays, and Nothing have no representation in the syntax.

Booleans

Booleans are the result of any operation, e.g. ==, <, or &&. Expressions in their entirety must also result in a boolean. There are two states (I'll use "state" to not confuse it with "value") for a boolean: true and false.

The relation operators, == and < et al., take two values (from above) as arguments and return a boolean.

The logical operators, &&, ||, and !, take two (one for !) booleans as arguments and return a boolean.

There is no relationship between the boolean states true and false and the JSON literals true and false.

There is no representation for booleans in the syntax.

Nodelists

Nodelists result from path evaluation.

There is no representation for nodelists in the syntax.

Conversion to a Value

A nodelist is convertible to a Value if it has at most a single node.

If a nodelist contains a node, the result is the value of that node.
If a nodelist contains no nodes, the resulting value is Nothing.

To enable syntax checking for paths in expressions, we have defined "Singular Path" as an identifiable syntax to ensure that a path can return at most one node and thus may be converted to a Value.

Conversion to a Boolean

A nodelist is always convertible to a Boolean. The resulting state is whether it contains nodes:

true if it contains nodes
false otherwise

Setting up the type system this way covers the entire expression syntax.

If we use V for a value, B for a boolean, and SNL and MNL for single nodelist and multiple nodelist respectively, the following are all valid expressions.

V == V
B && B
B || V != V
B
!B
!B && V < V
SNL == V
MNL && B
SNL && B
MNL
SNL

The following are not valid expressions.

V
B && V
B == V
!V
MNL == V

Note that values and booleans each have their place in the syntax, and one cannot be substituted for the other because there is no mapping between them.

This is all well-formed and consistent.

Now, when we add functions, we need to add them in such a way that they fit into this system. The way to ensure that is to type the function based on its return. This way, we know where in the expression the function is valid.

If a function returns a value, it may only be used where a value may be used, i.e. relations.
If a function returns a boolean, it may only be used where a boolean may be used, i.e. logical operations and entire expressions.
If a function returns a nodelist, it may only be used where a nodelist may be used, i.e. anywhere.

While the first two are fairly straightforward, a function returning a nodelist presents an inconsistency: a function returning a nodelist can appear in an expression as a value only if it contains a single node, but we have no syntactic way to ensure that the nodelist it returns has at most a single value.

To address this inconsistency, we need to update our conversion rules for nodelists. Instead of:

A nodelist is convertible to a Value if it has at most a single node.

If a nodelist contains a node, the result is the value of that node.

If a nodelist contains no nodes, the resulting value is Nothing.

We now use:

A nodelist is always convertible to a Value.

If a nodelist contains a single node, the result is the value of that node.

If a nodelist contains no nodes or multiple nodes, the resulting value is Nothing.

We also keep the singular path requirement in the syntax because identifying paths that are guaranteed to result in zero-or-one -length nodelists is easy.

This also means that MNL == V is now valid, but because of the singular path requirement it can only occur if the MNL is returned by a function.

Finally, because of the conversion rules from nodelist to value and from nodelist to boolean, the same function return may behave differently (and perhaps unexpectedly) when a nodelist function appears as value vs as a boolean.

Function return	Converted Value	Converted Boolean
empty nodelist	`Nothing`	false
single-node nodelist	the node's value	true
multiple-node nodelist	`Nothing`	true

I believe this is okay but I advise that we explicitly call it out as a note, possibly with an example as well.

gregsdennis commented 1 year ago

(Note that this ☝️ removes all of the "optional" types as well.)

gregsdennis commented 1 year ago

@timbray here's what I came up with regarding my experience trying to implement the current document.

My implementation

In implementing expressions, I developed a typed binary tree structure with three kinds of nodes:

Value nodes - These represent values and can only be used in comparisons.
Logic nodes - These represent logical outcomes and come in two flavors:
- Comparison nodes - These take two values and an operator.
- Logical operation nodes - These take one or two logical outcomes and an operator.
Nodelist nodes - These represent nodelists and (prior to functions) are only returned by paths.

(Note, this isn't actually my code, but it's similar enough to what I have.)

class ExpressionNode {}

class ValueNode : ExpressionNode
{
    public JsonValue? Evaluate(...) {...}
}

class LogicNode : ExpressionNode
{
    public bool Evaluate(...) {...}
}

class NodelistNode : ExpressionNode
{
    public NodeList GetValue();
}

class PathNode : NodelistNode
{
    public JsonPath Path;
}

class ComparisonNode : LogicNode
{
    public ValueNode Left;
    public ValueNode Right;
    public ComparisonOperator Operator;
}

class BinaryLogicNode : LogicNode
{
    public LogicNode Left;
    public LogicNode Right;
    public BinaryLogicalOperator Operator;
}

class UnaryLogicNode : LogicNode
{
    public LogicNode Operand;
    public UnaryLogicalOperator Operator;
}

Functions need to fit into this structure, too, so we create function types for each type of node.

class ValueFunctionNode : ValueNode
{
    public ExpressionNode[] Arguments;
}

class LogicFunctionNode : LogicNode
{
    public ExpressionNode[] Arguments;
}

class NodeListFunctionNode : NodeListNode
{
    public ExpressionNode[] Arguments;
}

With this, I can now represent expressions like @.a || !foo(b) && c==bar(d).

- ||          // BinaryLogicNode
  - @.a       //   PathNode
  - &&        //   BinaryLogicNode
    - !       //     UnaryLogicNode
      - foo   //       LogicFunction
        -b    //         ValueNode
    - ==      //     ComparisonNode
      - c     //       ValueNode
      - bar   //       ValueFunction
        - d   //         ValueNode

As you can see, everything is typed nicely. Everything gets what it expects. Logic get other logic and comparisons get values.

Now, according to the current document, match() and search() should return a boolean:

2.6.5. match Function Extension

Arguments:

OptionalNodeOrValue (string)

Value (string conforming to [I-D.draft-ietf-jsonpath-iregexp])

Result:

OptionalBoolean (true, false, or Nothing)

which means that with my model, they can't appear in a comparison. But the document shows examples where it's being used both in a test expression and in a comparison:

Query Comment

$[?match(@.timezone, 'Europe/.*')] Valid typing

$[?match(@.timezone, 'Europe/.*') == true] Valid typing

Query	Comment
`$[?match(@.timezone, 'Europe/.*')]`	Valid typing
`$[?match(@.timezone, 'Europe/.*') == true]`	Valid typing

Something is off. I considered that my model was wrong for a while, but I've made many similar expression-parsing apps previously (it has been a special project of mine since high school), and I've never run into a case where a value makes sense where logic should be. So I started to explore the type system more deeply. The many comments above have resulted from this exploration.

Initial investigation

This is certainly the most loosely typed system I've ever parsed, but it's still typed, so it should behave accordingly.

In JSON Path, we say that "42"==42 is valid and returns false because the string "42" and the number 42 are both typed as "values" and those values are not equal. (In C#, they're not even comparable because they're different types, and compilation will fail.) However, (1==1)==true (which is perfectly legal in C#) is not valid for JSON Path because (disregarding the ABNF; I'm just discussing the theory here) the logical produced by (1==1) is not a "value" and can't be used with the == operator. We encoded this into the ABNF because of this reasoning.

So this is the first problem (and what this issue was created for): if a Boolean is to be used as a "logical", it cannot be a subtype of Value. They cannot be relatable at all because we decided previously that JSON true and false literals were merely constant values, not "logicals."

This also applies to functions. If a function is defined to return a Boolean, then it can only appear in an expression where a "logical" would appear. Conversely, if a function is defined to return a Value, then it can only appear in an expression where a "value" would appear. This means $[?match(@.timezone, 'Europe/.*') == true] must be invalid.

The same logic applies, though in reverse, if you take Boolean to mean "JSON true or false," and $[?match(@.timezone, 'Europe/.*')] is the invalid one in this case.

So we can't have both.

Another reason we can't have both

Traditionally, a "function" takes a number of arguments and deterministically and in a context-free manner produces the same result.

The document, however says this for when the function exists in a test expression:

if the function expression is of type OptionalBoolean or one of its subtypes, it tests whether the result is true

and also this:

Boolean is an abstraction of a primitive value that is either true or false.

This means that the function changes behavior depending on the context, meaning it's not really a "function."

If you want to claim that there is an implicit conversion from the true or false that the function returns to the "logical" required by the existence test, then it follows that the literals themselves should also receive that implicit conversion, meaning $[?true] should be valid, and we already decided that it's not for the reasons I mentioned before.

Summary

It always comes down to an intentional separation between "logicals" and JSON true/false, and the type system as it exists tries to blur that line.

If we want to re-hash that discussion, I'm sure we'll come to the same conclusion that they are not the same, nor can they be converted between. To do so would introduce ambiguities and remove functionality. (This was the reasoning behind the initial decision.)

ietf-wg-jsonpath / draft-ietf-jsonpath-base

`Boolean` and `Value` (and their optional variants) should be distinct #387

Edit

Value

Boolean

Nodelist

A couple definitions and conventions

@.a as an existence test implies a distinction

The problem with match() and search()

An analysis of @.a

Applying the analysis to match() and search()

Values

Booleans

Nodelists

Conversion to a Value

Conversion to a Boolean

My implementation

Initial investigation

Another reason we can't have both

Summary

`@.a` as an existence test implies a distinction

The problem with `match()` and `search()`

An analysis of `@.a`

Applying the analysis to `match()` and `search()`