descendant selector non-determinism

glyn commented 2 years ago

Currently, the descendant selector spec allows some non-determinism which may be unexpected.

The spec currently says that in the resultant nodelist from the descendant selector:

nodes occur before their children, and
nodes of an array occur in array order.

This sometimes allows multiple orderings when the result includes children of a node, but not the node itself.

For example, with the document:

[0, [1]]

the selector $..[0] could, according to the spec, produce the nodelist:

[0, 1]

or the nodelist:

[1, 0]

This is surprising given that there are no objects in the document, since object ordering is the usual source of non-determinism.

We could change the spec to constrain the order of traversal rather than the order of the resultant nodelist, but this seems to be getting into implementation detail.

Is the current approach acceptable?

graza commented 2 years ago

Hi Glyn,

I treated the descendant-selector and the following object member name, index-selector, or wildcard as two separate steps. In order for the next step to be able to return values as per the examples, the first node that needs to be presented by the descendant-selector to the next step is the entirety of the current node (in case of $.. the whole document).

nodes occur before their children,

Given what I've written above, in the example, the whole document [0,[1]] is always the first thing presented to the [0] index selector.

nodes of an array occur in array order.

Given this, when the descendant-selector goes into the array, it will present the nodes 0 and [1] to the [0] index selector. (Note that node 1 is a child of node [1], so 1 comes after.)

Since the [0] selector presumably only applies to arrays, it only applies to nodes [0,[1] and [1], and the "nodes before children" gives an explicit order, I don't see how the result could be [1,0].

Have I missed something? Is it a matter of updating the semantics to say the entirety of the current node is looked at first?

(Side note - the spec is not as clear on the element-index only applying to arrays. It's implied, but not stated as clearly as it is for quoted-member-names and objects.)

Kind Regards, Graham

glyn commented 2 years ago

Have I missed something? Is it a matter of updating the semantics to say the entirety of the current node is looked at first?

I think something like that is probably desirable, but without unduly constraining implementations.

(Side note - the spec is not as clear on the element-index only applying to arrays. It's implied, but not stated as clearly as it is for quoted-member-names and objects.)

Thanks. I'll take a look at that with an editorial PR.

glyn commented 2 years ago

I think rewording the second condition ("nodes of an array occur in array order") achieves the desired ordering:

nodes of a given array, and descendants of nodes of the given array, occur in array order.

gregsdennis commented 2 years ago

I don't see how the result could be [1,0].

I'm still here. I'm not sure that anything needs to be updated. I think the current text (node then descendants) is sufficient to guarantee the correct ordering.

In this case,

[0, [1]] is processed, producing 0
then [1] is processed, producing 1

It must be this way because of array ordering.

.. gives the (ordered) node list [ [0,[1]], 0, [1], 1 ]. Of those, we take the first elements of arrays, which is 0 from [0,[1]] and 1 from [1]. I'm really struggling to see how the current wording can yield anything else.

glyn commented 2 years ago

The behaviour of four existing implementations is interesting. With the document:

[["b"], ["c"]]

and the selector $..*, Nebhale produces a result obeying the reworded conditions (but starts the nodelist with $):

[
   [
      [
         "b"
      ],
      [
         "c"
      ]
   ],
   [
      "b"
   ],
   "b",
   [
      "c"
   ],
   "c"
]

Jayway and Goessner place "b" after ["c"], contrary to the reworded conditions:

[
   [
      "b"
   ],
   [
      "c"
   ],
   "b",
   "c"
]

and Gatling doesn't even get close to the spec:

[
   [
      [
         "b"
      ],
      [
         "c"
      ]
   ]
]

glyn commented 2 years ago

I don't see how the result could be [1,0].

I'm still here. I'm not sure that anything needs to be updated. I think the current text (node then descendants) is sufficient to guarantee the correct ordering.

In this case,
* `[0, [1]]` is processed, producing `0`

* then `[1]` is processed, producing `1`
It must be this way because of array ordering.

.. gives the (ordered) node list [ [0,[1]], 0, [1], 1 ]. Of those, we take the first elements of arrays, which is 0 from [0,[1]] and 1 from [1]. I'm really struggling to see how the current wording can yield anything else.

The problem with the current wording is that it allows too much freedom in the resultant nodelist. The result [1, 0] is allowed because:

nodes occur before their children (since 0 isn't a child of 1 or vice versa), and
nodes of an array occur in array order (since 0 and 1 are not nodes of the same array).

I think you are interpreting ..* differently from the spec. You seem to be regarding it as a combination of .. and * (all?), but there is no .. selector as such in our spec (the summary table mentions .. and perhaps this was misleading you - it needs changing). (Also, you include the input node itself in the output of .. whereas the spec allows only strict descendants.)

cabo commented 2 years ago

The spec currently says that in the resultant nodelist from the descendant selector:

nodes occur before their children, and

nodes of an array occur in array order.

It probably should say that this is the node order of the node list that describes the entire input JSON value.

A node list is a subset of that novelist, preserving order.

glyn commented 2 years ago

The spec currently says that in the resultant nodelist from the descendant selector:

nodes occur before their children, and

nodes of an array occur in array order.

It probably should say that this is the node order of the node list that describes the entire input JSON value.

Hmmm. That's a bit tricky because there isn't necessarily a unique node order (because objects are not ordered). So, at best, it describes a collection of ordered nodelists and then the resultant nodelist is a subset of any one of those nodelists, preserving order.

A node list is a subset of that novelist, preserving order.

I think the reworded condition above captures the constraint more simply.

gregsdennis commented 2 years ago

nodes occur before their children, and

nodes of an array occur in array order.

These rules necessitate that 0 come before 1 in the above case.

Let's look at the [["b"], ["c"]] case because it's more interesting given the array as the first item.

Evaluating the array in sequence and nodes before children has two possible processes:

Evaluate each item, evaluating its children before moving to the next item.
- Order of processing is [["a"],["b"]], ["a"], "a", ["b"], "b".
Queue each item, queuing its children as the item is processed, eventually running through the queue.
- Order of processing is [["a"],["b"]], ["a"], ["b"], "a", "b".

In both cases, the resulting node list is [ ["a"], "a", "b" ]. You never get "a" before "b", which means that you will always get [ 0, 1 ] in the first example.

The only question is whether .. does 1 or 2 from above.

glyn commented 2 years ago

I would agree, if the spec required evaluating the array in sequence and nodes before children, but it doesn't. It only talks about the order of the resultant nodelist. What text in the spec makes you think it is talking about the order of evaluation, as such?

Of course, the order of the resultant nodelist implies some things about the order of evaluation, but it doesn't say anything about the ordering of nodes relative to the descendants of their siblings.

gregsdennis commented 2 years ago

if the spec required evaluating the array in sequence and nodes before children,

I thought we already defined this. I'll need to consult the spec again.

glyn commented 2 years ago

It's surprising the comparison project has the following concensus for $..* applied to [[0], [1]]:

[
  [
    0
  ],
  [
    1
  ],
  0,
  1
]

graza commented 2 years ago

Perhaps the others also did what I did and treated the .. and what follows as two separate steps? The descendant part only visits each of the nodes leaving it to the second part to process what's there. This allows the same code to be used as for the non-descendant version of the same selector. (It also means that my descendant selector allows list selectors because I removed the other types of selectors in favour of that.)

If instead ..* is treated as a single step and takes care of both jobs, descending and wildcard, then I could see how the following alternate result could be returned:

[
    [ 0 ],
    0
    [ 1 ],
    1
]

The definitions at the top of this JSONPath document mentions a "transitive closure over the chidlren relation" but that was too mathematical for me.

I had a look at XPath's descendant-or-self but the spec didn't tell me exactly what to expect. I tried the following XML using an online XPath tester:

<arr>
  <el>
    <arr>
      <el>0</el>
    </arr>
  </el>
  <el>
    <arr>
      <el>1</el>
    </arr>
  </el>
</arr>

An XPath of //el produced something like the "alternate result" I've given above:

Element='<el>
   <arr>
      <el>0</el>
   </arr>
</el>'
Element='<el>0</el>'
Element='<el>
   <arr>
      <el>1</el>
   </arr>
</el>'
Element='<el>1</el>'

gregsdennis commented 2 years ago

Perhaps the others also did what I did and treated the .. and what follows as two separate steps?

Yes, that's what I've done in my implementation as well.

But in terms of the specification, .. must always be coupled with another selector, whether that's a name, wildcard, or bracket-syntax. As such it makes sense to define it in conjunction with the "second" selector.

There's some discussion in the PR as well.

cabo commented 2 years ago

On 6. Jul 2022, at 08:32, Glyn Normington @.***> wrote:

The spec currently says that in the resultant nodelist from the descendant selector:

• nodes occur before their children, and • nodes of an array occur in array order. It probably should say that this is the node order of the node list that describes the entire input JSON value.

Hmmm. That's a bit tricky because there isn't necessarily a unique node order (because objects are not ordered). So, at best, it describes a collection of ordered nodelists and then the resultant nodelist is a subset of any one of those nodelists, preserving order.

All the spec needs to require is that there is an input order that the output nodelist is a subsequence of.

A node list is a subset of that novelist, preserving order.

I think the reworded condition above captures the constraint more simply.

After you have banged it into shape to actually do this, I’m not so sure…

Grüße, Carsten

glyn commented 2 years ago

All the spec needs to require is that there is an input order that the output nodelist is a subsequence of.

That would work, but we'd need to define "input order" carefully as it's not necessarily unique because of objects.

glyn commented 2 years ago

@cabo I had a go at the "input order" approach. See the additional commit in PR 218.

cabo commented 2 years ago

Great. Maybe we should give orders that satisfy the criteria a name; with your text, maybe "visit order"?

glyn commented 2 years ago

Great. Maybe we should give orders that satisfy the criteria a name; with your text, maybe "visit order"?

Happy to attempt something once https://github.com/ietf-wg-jsonpath/draft-ietf-jsonpath-base/pull/220 lands. I feel that "visit order" is a little too generic. I would prefer a more specific term such as "array preorder". WDYT?

(Happy to continue this thread here, but feel free to move to a fresh editorial issue if you prefer.)

ietf-wg-jsonpath / draft-ietf-jsonpath-base

descendant selector non-determinism #215