ietf-wg-jsonpath / draft-ietf-jsonpath-base

Development of a JSONPath internet draft
https://ietf-wg-jsonpath.github.io/draft-ietf-jsonpath-base/
Other
59 stars 20 forks source link

Behavior for semantically invalid expressions #212

Closed gregsdennis closed 2 years ago

gregsdennis commented 2 years ago

What is the expected behavior for expressions which are not semantically valid?

For example, < is only defined for numbers. So what if someone does

$[?(@.foo < 'bar')]

Is this a parse error? Does it just evaluate to false for all potential @.foo values?

This kind of thing will be important to define if/when we decide to support more complex expressions (e.g. with mathematic operators) or inline JSON values such as objects or arrays.

gregsdennis commented 2 years ago

Well, I chose Erlang because their total ordering is the result of a deliberate design process.

And typed comparisons in C# are the result of a deliberate design process. What's your point? You're invoking a selection bias.

Our decision to not have type casting (i.e. false =/=> 0) demonstrates that types are important to us. That line of thinking necessitates that any comparison between different types must either error or return false because such a comparison is meaningless. We're not erroring, so we must return false.

glyn commented 2 years ago

any comparison between different types must either error or return false because such a comparison is meaningless. We're not erroring, so we must return false.

Are we now agreed then?

gregsdennis commented 2 years ago

If by in agreement you mean that !(@<true) must also return false.

glyn commented 2 years ago

If by in agreement you mean that !(@<true) must also return false.

According to the current draft, @ >= true returns false and so @ < true returns true and thus !(@ < true) returns false.

gregsdennis commented 2 years ago

Okay... You're twisting my examples. I want any expression that contains comparisons between types to return an empty nodelist.

cabo commented 2 years ago

I want all items that are not back-ordered or backordered for less than 10 days.

[?(!@.backordered || @.backordered < 10)]
danielaparker commented 2 years ago

@timbray I'd like to go with the simplest possible thing

The simplest possible thing would be to do what the vast majority of JSONPath implementations do when a comparison happens that is regarded as not supported, which is to abort evaluation and report a diagnostic. That has the additional advantage of being compatible with the charter "capturing the common semantics of existing implementations", if that still matters. It is also consistent with XPATH 3.1, see (https://www.w3.org/TR/xpath-31/#id-general-comparisons) and (https://www.w3.org/TR/xpath-31/#id-handling-dynamic).

The next simplest thing would be to abort evaluation and return an empty list (a few existing JSONPath implementations do that.)

timbray commented 2 years ago

Speaking with my co-chair hat on, I'd like to draw attention to the following text from section 3.1 of the current draft: "The well-formedness and the validity of JSONPath queries are independent of the JSON value the query is applied to; no further errors can be raised during application of the query to a value."

I think this has for a long time represented the consensus of the WG. If someone wants to approach the problems raised in this issue by proposing an exception mechanism, that would require a specific proposal covering the details and how best to specify it. Absent such a proposal existing and getting consensus, I think approaches that include a run-time error/exception mechanism are out of bounds.

timbray commented 2 years ago

The next simplest thing would be to abort evaluation and return an empty list (a few existing JSONPath implementations do that.)

Co-chair hat off: I could live with that. Among other things, it's easy to describe. It's not my favorite approach but it's sensible.

cabo commented 2 years ago

On 2022-07-20, at 22:15, Tim Bray @.***> wrote:

Speaking with my co-chair hat on, I'd like to draw attention to the following text from section 3.1 of the current draft: "The well-formedness and the validity of JSONPath queries are independent of the JSON value the query is applied to; no further errors can be raised during application of the query to a value."

I think this has for a long time represented the consensus of the WG. If someone wants to approach the problems raised in this issue by proposing an exception mechanism, that would require a specific proposal covering the details and how best to specify it. Absent such a proposal existing and getting consensus, I think approaches that include a run-time error/exception mechanism are out of bounds.

Please don’t commingle exceptions with erroring out. Erroring out because of data fed to the expression is not consistent with the above invariant. Exceptions may be processed within the query (e.g., in the classical catch/throw form), and need not violate that invariant. “NaB” is an attempt to add to the data types in such a way that an exception can be handed up the evaluation tree as a return value.

Exceptions tend to make the outcome of the query dependent of the sequence in which parts of the query expression are processed, so they may be violating other invariants (which are implicit in some of our minds).

Grüße, Carsten

timbray commented 2 years ago

I've been thinking why I keep liking having type-mismatch comparisons be just false and made some progress.

I think that $.foo<3 is a compact way of saying IF ($.foo is a number) && (its value is less than 3). So if it's not a number, this is unsurprisingly false. If you believe this then it makes perfect sense that if $.foo is "xyz" then !$.foo<3 is true but $.foo>=3 is false.

glyn commented 2 years ago

@timbray If $.foo is "xyz", then the spec's statement:

comparisons using one of the operators <= or >= yield true if and only if the comparison is between numeric values which satisfy the comparison.

implies that $.foo>=3 is false. Then the spec's statement:

any comparison of two values using one of the operators !=, >, < is defined as the negation of the comparison of the same values using the operator ==, <=, >=, respectively.

implies that $.foo<3, the negation of $.foo>=3, is true.

Essentially, < and > give non-intuitive results for non-numeric comparisons. This is surprising and far from ideal.

However, because comparisons always produce a boolean value, they can be hedged around with other predicates to get the desired result, e.g. we can replace $.foo<3 with $.foo<=3 && $.foo!=3 which is equivalent to $.foo<3 for numeric $.foo and is false for non-numeric $.foo.

glyn commented 2 years ago

The approach of forcing the filter expression to return an empty nodelist whenever it contains an "undesirable" comparison means that:

Let's take an example. Suppose we want to pick out all objects in an array such that the object has a key expiry which is either at least 9 or equal to true. For example, in the JSON:

[ {"expiry": 1}, {"expiry": 9}, {"expiry": true}]

With the current spec, the filter [?@.expiry >=9 || @.expiry == true] has the desired effect and returns the nodelist:

[{"expiry": 9}, {"expiry": true}]

The alternative doesn't allow that kind of hedging around. A solution with the alternative approach is to use a list [?@.expiry >=9, ?@.expiry == true].

glyn commented 2 years ago

Next, let's explore the ordering issue with the alternative approach.

What should the behaviour of [?1 == 1 || @ < $.t] be (where $.t results in a nodelist with a single node with value true)? (Apologies this is for the third time of asking.)

With the approach of forcing the filter expression to return an empty nodelist whenever it contains an "undesirable" comparison, there would appear to be three options:

  1. Prescribe an order of evaluation such as "left to right" and allow short-circuits. Result: all nodes are selected.
  2. Do not prescribe an order of evaluation and allow short-circuits. Result: non-determinism - all or no nodes are selected, depending on the implementation. (That's bad for testing and interop.)
  3. Disallow short-circuits (in which case the order of evaluation doesn't matter). Result: no nodes are selected, but the implementation cannot take advantage of the short-circuit as an optimisation.

I think option 1 is preferable and probably the most intuitive option. I wonder if there are any other issues (apart from being rather prescriptive) with that option? The boolean algebra laws would only apply in general when "undesirable" comparisons are not present.

timbray commented 2 years ago

@timbray If $.foo is "xyz", then the spec's statement:

comparisons using one of the operators <= or >= yield true if and only if the comparison is between numeric values which satisfy the comparison.

I suggest changing this to "…using one of the operators <, >, <=, and >=.

Then…

implies that $.foo>=3 is false. Then the spec's statement:

any comparison of two values using one of the operators !=, >, < is defined as the negation of the comparison of the same values using the operator ==, <=, >=, respectively.

I suggest removing this statement because we've defined < and >, then there's some work to tidy up the definition of == and !=. And in every case, any comparison with a type mismatch is always false.

cabo commented 2 years ago

And in every case, any comparison with a type mismatch is always false.

Actually, != with a type mismatch is always true.

I was hoping to extend the definition by negation to < and >, but that seems to create cognitive dissonances.

timbray commented 2 years ago

What should the behaviour of [?1 == 1 || @ < $.t] be (where $.t results in a nodelist with a single node with value true)? (Apologies this is for the third time of asking.)

If you say that type mismatch always yields false, there's no problem, this reduces to true || false, which is to say true. You don't need to say anything about order of evaluation. I think if the spec needs to specify order of evaluation, that is a very severe code smell. I'd find that very hard to accept.

timbray commented 2 years ago

And in every case, any comparison with a type mismatch is always false.

Actually, != with a type mismatch is always true.

No, the rule I'm proposing is simpler: Any comparison with a type mismatch is always false. Which means != is not specified as "the opposite of ==" it's specified as meaning "are values of the same non-structured type, and the values are not equal". I can't tell wither $.foo and true are equal or not if $.foo is not a boolean. They are not comparable.

cabo commented 2 years ago

So $.a != null is always false. Not intuitive to me.

timbray commented 2 years ago

Hmm, having written that, I might have been going overboard. It would be perfectly OK to define != as "true if the operands are of different types, or of the same type but not equal" and that would make perfect sense.

cabo commented 2 years ago

We also need to design in an escape clause for structured values, which so far we don't compare.

timbray commented 2 years ago

OK, I think I have probably said enough on this, but I have been doing an unsatisfactory job communicating what seems to me like a reasonably straightforward proposal. Once the spec stabilizes a bit I'd be happy to do a PR. The key language, under "Comparisons" in 3.4.8, would be:

== True if both operands are of the same type, that type is not a structured type, and the values are equal; otherwise false != True if either operand is of a structured type, or if the operands are of different types, or if they are of the same type but the values are unequal; otherwise false. > True if both operands are numbers and the first operand is strictly greater than the second; otherwise false. < True if both operands are numbers and the first operand is strictly less than the second; otherwise false. >= True if both operands are numbers and the first operand is greater than or equal to the second; otherwise false. <= True if both operands are numbers and the first operand is less than or equal to the second; otherwise false.

glyn commented 2 years ago

That approach, like the current draft, preserves the laws of boolean algebra and therefore the order of evaluation doesn't matter. It gets rid of some nasty surprises (such as true < true) which are present in the current draft.

It seems that != is the negation of ==, in which case it's probably simpler to define it as such.

However, the approach is not completely free of surprises because it breaks:

as well as:

Thus <= and >= would no longer be partial orders and < and > would no longer be strict total orders when these four operators are considered as binary relations over the whole set of values.

We could rationalise this by thinking of these four operators as orderings of numbers with non-numbers added as "unordered" extensions.

timbray commented 2 years ago

It seems that != is the negation of ==, in which case it's probably simpler to define it as such.

Eh, I like writing each one out in clear English rather than having them depend on each other. But, editorial choice.

However, the approach is not completely free of surprises because it breaks:

the converse relationship between < and >= that a < b if and only if not a >= b the converse relationship between > and <= that a > b if and only if not a <= b

It's still true if a & b are both numbers. If you accept that "<" means "A is a number AND B is a number AND A<B" then I think all the logical relationships work out properly.

Anyhow, it would be helpful at this point if someone could offer the WG a summary of what the coherently-proposed options are that are available to us to finish off this issue? I will if nobody else leaps in, but I'm not neutral on the issue.

glyn commented 2 years ago

However, the approach is not completely free of surprises because it breaks:

the converse relationship between < and >= that a < b if and only if not a >= b the converse relationship between > and <= that a > b if and only if not a <= b

It's still true if a & b are both numbers.

Agreed.

If you accept that "<" means "A is a number AND B is a number AND A<B" then I think all the logical relationships work out properly.

Only if A and B are numbers! For example, let a=b=true, then a<b is false but not (a>=b) is true. Thus a < b if and only if not a >= b is false in this case.

glyn commented 2 years ago

Anyhow, it would be helpful at this point if someone could offer the WG a summary of what the coherently-proposed options are that are available to us to finish off this issue? I will if nobody else leaps in, but I'm not neutral on the issue.

Let me try:

  1. The current draft as of 2022-07-22.
  2. The option of making <, >, <=, and >= false when not comparing two numbers.
  3. The NaB proposal.
  4. The NaB proposal, but with left-to-right evaluation and short-circuiting.
  5. The option of forcing a total order among all values like Erlang does.

(I think options 2-5 also need the current spec wording about paths which result in an empty nodelist, about which I think there is consensus.)

timbray commented 2 years ago
  1. The option of requiring an empty node list if any type-mismatch comparisons occur anywhere in the expression. [right?]
glyn commented 2 years ago
  1. The option of requiring an empty node list if any type-mismatch comparisons occur anywhere in the expression. [right?]

That's certainly a coherent option, so yes. But please note I don't think anyone actually proposed this, although I may have misunderstood @gregsdennis. I thought his idea was to skip the current item if there was a type-mismatch (this has the same external semantics as the NaB proposal). (If there was a static type mismatch, then clearly the result would be an empty node list.)

glyn commented 2 years ago

On reflection, I think my preferred option is number 2: making <, >, <=, and >= false when not comparing two numbers:

cabo commented 2 years ago

Option 2 is not my preferred option, but I sure could live with that.

timbray commented 2 years ago

Option 2 is not my preferred option, but I sure could live with that.

What is?

cabo commented 2 years ago

I probably would have gone towards full symmetry of < and >=. I'm also not sure the simple approach we now have is optimal in the face of future extensions (catch/throw is more like that). But we also need to make sure we align with existing implementations, and keep some simplicity.

danielaparker commented 2 years ago

@glyn:

  1. The current draft as of 2022-07-22.
  2. The option of making <, >, <=, and >= false when not comparing two numbers.
  3. The NaB proposal.
  4. The NaB proposal, but with left-to-right evaluation and short-circuiting.
  5. The option of forcing a total order among all values like Erlang does.

(I think options 2-5 also need the current spec wording about paths which result in an empty nodelist, about which I think there is consensus.)

It is noted though that (1) - (4) make the specification broadly incompatible with all JSONPath implementations that are represented in the JSONPath Comparisons. For example, most JSONPath implementations support string comparisons with <, >, <=, and >=, including all of the Javascript ones, and Java Jayway, see Filter expression with greater than string. Proposals (1)-(4) would produce results that users of these implementations might find surprising.

In addition, a few of the JSONPath implementations support array comparisons with ==, and !=, including perhaps the most widely used implementation of all, Java Jayway, see Filter expression with equals array or equals true, Filter expression with equals array or equals true and Filter expression with equals array or equals true. Users coming from a Jayway background might find these proposed rules surprising, and, in fact, a regression.

Users of other implementations that don't currently support a comparison might expect a diagnostic message, as that is how the vast majority of current implementations behave, rather than a result list of questionable usefulness. It is noted that a few implementations return an empty result list if encountering a comparison deemed invalid. But none exhibit behaviour as proposed in (1) - (4).

It seems to this commentator that the credible alternatives, assuming some connection with prior work is still considered desirable, are

Regarding the last alternative, I'm not familiar with what Erlang does, but there is considerable prior work on defining complete orderings over JSON values, it's a common requirement that JSON values have to be comparable. There's a degree of arbitrariness involved, when values of different types are to be compared, but similar conventions have been adopted.

EDIT: Jayway supports array comparisons with ==, and !=, but not with <, >, <=, and >= as I originally said.

cabo commented 2 years ago

It is noted though that (1) - (4) make the specification broadly incompatible with all JSONPath implementations that are represented in the JSONPath Comparisons. For example, most JSONPath implementations support string comparisons with <, >, <=, and >=, including all of the Javascript ones, and Java Jayway, see Filter expression with greater than string. Proposals (1)-(4) would produce results that users of these implementations might find surprising.

Yep. Simple solution: Provide for string comparisons. This is complicated if one accepts a lot of diverse user requirements, but maybe we are just going for compatibility with the installed base.

Define a complete ordering over JSON values

This is becoming ugly for maps, but even that can be done.

cabo commented 2 years ago

For an existence proof

https://www.erlang.org/docs/19/reference_manual/expressions.html#id81100

Note that map and tuple ordering in Erlang are optimized for speed, not for minimal surprise. We would probably want to go for per-element/per-ordered-entry comparison.

timbray commented 2 years ago

I agree that strings should be comparable. Does anyone have a good counter-argument?

As previously noted, up till now the WG has not favored the abort/diagnostics approach, but is free to change its mind (I don't have any opinions on this).

If the WG wants to define orderings on all the JSON types, that doesn't seem insane. But neither does stopping at numbers and strings. Hmm, all the JSON types? What about true/false/null?

cabo commented 2 years ago

I agree that strings should be comparable. Does anyone have a good counter-argument?

The counter argument was that this is hard to do well, so it didn't fit into a MVP. But it is easy to sort Unicode Scalar Values.

As previously noted, up till now the WG has not favored the abort/diagnostics approach, but is free to change its mind (I don't have any opinions on this).

I continue to prefer avoiding that -- this is simply part of the transition to an industrial strength spec.

If the WG wants to define orderings on all the JSON types, that doesn't seem insane. But neither does stopping at numbers and strings. Hmm, all the JSON types? What about true/false/null?

Yes! (See Erlang link above.)

danielaparker commented 2 years ago

If the WG wants to define orderings on all the JSON types, that doesn't seem insane. But neither does stopping at numbers and strings. Hmm, all the JSON types? What about true/false/null?

It's arbitrary, but the usual approach is to associate ordinal numbers with each type (null,true,false,string,number,array,object) and compare the numbers when the types are different. I originally took that approach from nlohmann json.

Not everyone agrees with that, for example, this Python JSON Comparison Package considers it an error if types differ, see (https://github.com/rugleb/JsonCompare/blob/master/jsoncomparison/compare.py).

cabo commented 2 years ago

Here's my order:

null < false < true < number < string < array < map

Numbers are compared as per their I-JSON interpretation.

Strings and arrays are ordered lexicographically (in the computer science, not the lexicographical sense -- we're probably not going for https://www.unicode.org/reports/tr10/tr10-45.html).

Ordering between two maps works by ordering the keys within each map first, and then ordering them like arrays made up of the map entries in the form [key, value, key, value, ...].

danielaparker commented 2 years ago

@cabo

Here's my order:

null < false < true < number < string < array < map

Arbitrary, but as good as any

Numbers are compared as per their I-JSON interpretation.

Strings and arrays are ordered lexicographically (in the computer science, not the lexicographical sense -- we're probably not going for https://www.unicode.org/reports/tr10/tr10-45.html).

Ordering between two maps works by ordering the keys within each map first, and then ordering them like arrays made up of the map entries in the form [key, value, key, value, ...].

Looks okay. I would have expressed it as comparing two sequences of name-value pairs (ordered by name)

[(key1, value1), (key2, value2), ...]

but I think that's equivalent.

glyn commented 2 years ago

I could live with total ordering of all JSON values, particularly because we then get the full set of ordering laws, which allow safe refactoring of expressions:

My main reservation with this approach is the size/complexity of the spec relative to the value to users, but let's see what it would take to nail down the edge cases.

I'd be grateful if someone could define the kind of lexicographic ordering of UNICODE strings which is being suggested. (Sorry if it's obvious to everyone else. I'm fuzzy about how encodings and escapes would be dealt with. And what about strings outside I-JSON?)

What is the ordering for numbers outside I-JSON?

What is the ordering for objects outside I-JSON (specifically those with duplicate keys)?

cabo commented 2 years ago

I'd be grateful if someone could define the kind of lexicographic ordering of UNICODE strings which is being suggested. (Sorry if it's obvious to everyone else. I'm fuzzy about how encodings and escapes would be dealt with. And what about strings outside I-JSON?)

A text string is an array of Unicode Scalar Values. Sort like arrays.

What is the ordering for numbers outside I-JSON?

Mathematical ordering always works. Just wanted to relax this to I-JSON.

What is the ordering for objects outside I-JSON (specifically those with duplicate keys)?

Well, again, implementations should not have to handle this. But you'll notice that if someone wants to make this extension, you just need to add "order map entries by key, then by value" to the algorithm I gave.

timbray commented 2 years ago

A text string is an array of Unicode Scalar Values. Sort like arrays.

There are lots of ways to get lost in details cough code points cough normal forms* but it’s very helpful to know that if you sort UTF-8 streams treating them as unsigned byte lists, you are also sorting by Unicode Scalar Value, which is the right thing to do.

timbray commented 2 years ago

@glyn:

I could live with total ordering of all JSON values … My main reservation with this approach is the size/complexity of the spec relative to the value to users...

Yes. It's not just a “reservation”, it’s a huge issue. And looking at the thread, it seems clear to me that imposing such an ordering involves decisions that seem whimsical and arbitrary, based on philosophy or maybe theology. Which is to say open to endless debate.

At this point I am in favor of ordering strings but will not argue if it seems others would rather not. I am becoming pretty strongly against an attempt to specify universal JSON cross-type value ordering or really any attempt to compare structured type values. I suggest we leave that for a future WG. I imagine >*, ==*, and friends that implement a universal-type-comparability framework, specified by people who aren't us.

cabo commented 2 years ago

On 25. Jul 2022, at 07:36, Tim Bray @.***> wrote:

A text string is an array of Unicode Scalar Values. Sort like arrays.

There are lots of ways to get lost in details cough code points cough normal forms*

(Which we don’t do)

but it’s very helpful to know that if you sort UTF-8 streams treating them as unsigned byte lists, you are also sorting by Unicode Scalar Value, which is the right thing to do.

Yes. But don’t say that too loud, or you will get the attention of people who think that JSON strings should instead be sorted by UTF-16 code units. (Yes, the non-IETF “JSON Canonicalization Scheme”, RFC 8785, does this. You can’t make this stuff up.)

I think it is easy to get consensus on ordering by Unicode Scalar Values, and the fact that UTF-8 was designed to be order-preserving then is a nice topping for that obvious decision.

Grüße, Carsten

cabo commented 2 years ago

On 25. Jul 2022, at 07:48, Tim Bray @.***> wrote:

@glyn:

I could live with total ordering of all JSON values … My main reservation with this approach is the size/complexity of the spec relative to the value to users...

Yes. It's not just a “reservation”, it’s a huge issue.

I just wrote up the entire spec for that in an issue comment. The “size” or “complexity” is not an issue.

And looking at the thread, it seems clear to me that imposing such an ordering involves decisions that seem whimsical and arbitrary, based on philosophy or maybe theology. Which is to say open to endless debate.

That depends mostly on the ability of the WG to actually decide. I think you are underselling us here.

At this point I am in favor of ordering strings but will not argue if it seems others would rather not. I am becoming pretty strongly against an attempt to specify universal JSON cross-type value ordering or really any attempt to compare structured type values. I suggest we leave that for a future WG. I imagine >, ==, and friends that implement a universal-type-comparability framework, specified by people who aren't us.

You are overstating this — this is not universal, just JSON. And, I’m sorry, but this stuff is my pay grade. This has been done before (Erlang example — and see how short that spec is!), and the JSON generic data model fortunately is even simpler. Re structured values: JSON has only two kinds, and arrays are a no-brainer. Maps are a bit more work, but non-surprising ways to order maps can be specified in a sentence or two.

I think we have seen it demonstrated that not specifying this leads to pain.

Grüße, Carsten

glyn commented 2 years ago

If we can compare structured values using >= and <=, the it seems churlish not to compare them using == (since, in a total ordering, a >= b && b >= a implies a == b).

I like that, but since we've held off comparing structured values for equality for so long, perhaps someone will remember what the difficulty was, in which case wouldn't the same difficulty also apply to >= and <=?

cabo commented 2 years ago

I like that, but since we've held off comparing structured values for equality for so long, perhaps someone will remember what the difficulty was, in which case wouldn't the same difficulty also apply to >= and <=?

We just didn't want to do the work, thinking that an MVP could be done without that.

After agonizing for a while on the effects of a partial ordering, it is now clear that going for a total ordering is the path of less resistance.

glyn commented 2 years ago

Ok. Let's see how a total ordering plays out and whether we can gain consensus for that.

I wonder if we should fold paths returning an empty nodelist into the total ordering. If $.foo returns an empty node list, the current draft says, for all a other than a path resulting in an empty node list:

These, of course, break the total ordering laws.

Perhaps instead we should extend the proposed total order to:

path-returning-an-empty-nodelist < null < false < true < number < string < array < map

cabo commented 2 years ago

path-returning-an-empty-nodelist < null < false < true < number < string < array < map

Or, to make the phrasing a bit less unwieldy,

absent < null < false < true < number < string < array < map