Closed michaelhkay closed 8 months ago
This is an interesting proposal. A few questions/observations:
Note 1: MarkLogic made their JSON types instances of node() so that they had defined parent, etc. accessors. Given that XPath/XQuery made arrays/maps functions then that is not possible, so this is potentially a valid workaround. -- Effectively, this provides a way to make arrays/maps/etc. used in JSON and JSON-like objects into pseudo-nodes.
Note 2: This is similar/related to the proposal/idea of making the parts of atomic types (year, day, month, authority, uri-path, etc.) available to users.
@rhdunn
I agree, that’s a creative and promising approach.
My immediate thoughts revolve around performance: When dealing with small or medium-sized JSON files, there will be no reason to think about performance. However, I’m aware of use cases in which millions of map & array entries are processed, and it would be significant overhead to create additional properties and attach them to each traversed item in main memory. That’s particularly relevant if singletons are used for representing common atomic items.
Some years ago, we simplified our data model for similar reasons: We stored (and eventually dropped) transient score values that resulted from XQuery Full Text queries. In some cases, it’s easy to skip the data generation if it will never be needed. In other cases, it can be pretty hard.
But I should definitely spend more time on the proposal before giving a final judgement.
Yes, I can think of various way of making the behaviour optional (so you don't get the properties unless you're going to need them), but I thought I would try to explore whether we can optimize the costs away first.
In the prototype Saxon implementation (see https://www.saxonica.com/documentation12/index.html#!functions/saxon/with-pedigree) you start by marking a map or array as being "with pedigree", and the "pedigree" (=transient properties) is only maintained by lookup operations that start from such a map or array.
This proposal feels liberating, and natural: just yesterday to retain key info, I pushed copies into a new map entry within the map in each one's value. Felt very hacky. And made me wonder about performance, because of possible ballooning effects.
It may be worth thinking about a mechanism that permits selective transient properties. There will be cases with very large maps where one wants to retain ¶key but not ¶parent.
I like the look and semantics of the pilcrow, but it will be a nuisance to find it. What about @@ or @ followed by some other punctuation character?
Finally, an opportunity to construct elegant predicate filter expressions.
So, I promised that part 2 of the proposal would address deep update.
Let's start with an example of what it should look like to users. The following increases the prices of all products, at any depth, by 10% (returning a new value that's the same as the original in all other respects):
update(select := ??product?price,
change := ->($p){$p * 1.1})
How do we make this work? I'll start with a very informal explanation, and then sketch a more formal definition.
Firstly, there's a third argument $root
which I've defaulted to the context item.
The select
argument selects a sequence of items; you can use any expression you like in principle, but the returned items must all have the transient property ¶root equal to $root - that means in practice (a) the expression must select only within the subtree rooted at $root, and (b) it must only use operations that set or preserve the ¶root property on their result.
The change
argument is a function that is applied to each selected item; the result of the function replaces the original in the corresponding position of the result tree. Internally, of course, the tree is immutable, so this means making a copy of the parent in which the relevant children have been modified, and so on recursively until you get back to the root.
If you want to add things to the tree, or delete things, then you do that by making a change to the parent. For example, to add a product, you can do update(select:=??products, change:=array:append(?, $newProduct))
.
There's a complication if the selection includes a node that is an ancestor of another selected node. There are various ways we could handle this: I would propose that it an ancestor is selected and changed, then neither its old children nor its new children are further processed.
Now, how to describe this more formally? I've glossed over a number of issues that need to be addressed. For example, I said that the ¶root property in all selected values must be equal to $root -- but what does "equal" mean here? Maps and arrays, remember, have no identity.
I think the best way to tackle this is probably to give values a transient identity for the duration of the operation. So we can sketch a formal description as follows:
Of course, an actual implementation will work differently. Many implementations will use underlying data structures (such as Java immutable maps) where (a) the nodes already have a perfectly usable ID, and (b) virtual copies can be made cheaply, reusing parts of the tree that haven't changed. But we don't need to talk about that.
An alternative syntax would be the rather COBOL-like
update $root changing ??product?price to (. * 1.1)
Using custom syntax rather than a function gives us more freedom in defining the semantics, but of course there are downsides as well.
One construct that has been proven to be particularly successful in BaseX is the update
keyword (https://docs.basex.org/wiki/XQuery_Update#update) for main-memory node updates:
(<a/>, <b/>) update {
rename node . as 'x'
} update {
insert node "y" into .
}
(:result :)
<x>y</x>
<x>y</x>
A simpler variant (for single nodes, without chaining) had been adopted to the XQuery Update 3 Facility as transform with
.
Unfortunately, due to grammar restrictions, two keywords are required for XQuery Update operations (insert node
instead of insert
, etc.). On the other hand, many users have already become accustomed to the syntax. A similar syntax could be used for maps and arrays:
(: existing syntax for nodes :)
document {
<root><product><price>2</price></product></root>
} update {
//product/price ! (replace value of node . with . * 1.1)
}
(: syntax for maps and arrays? :)
[
map { 'product': map { 'price': 2 } },
map { 'product': map { 'price': 3 } }
] update {
??product?price ! (replace value of . to . * 1.1)
}
Closing this as it was essentially implemented in PR #988.
After exploring many alternatives, I have come to the conclusion that we can't solve the problem of deep navigation and transformation of JSON structures without a data model change.
Most of the problems boil down to this: JSON trees do not have parent pointers, therefore after navigating down to a leaf node of the tree, we cannot get any information from higher up the tree. The solution to this (the "zipper" model) is to retain transient information about how a particular node in the tree was reached, so that we can retrace our steps and revisit nodes that were passed en route.
The change I propose is quite minor, but powerful: Any XDM value can be augmented with a set of transient properties represented as a set of key-value pairs. These properties are ignored (and typically dropped) by all operations on a value, except where otherwise specified. For the purpose of exposition, I'll use the syntax
$value¶name
to refer to the transientname
property of$value
.We'll change the semantics of
map:get()
andarray:get()
, and the associated lookup operators, so that the resulting values have transient properties indicating how they were selected. For example, givenlet $name := $person?firstName
the resulting value (perhaps the string "Michael") will be augmented with transient properties
and derived properties:
We can also define other "downward selection" operations such as
map:find
, andarray:foot
to retain these transient properties. So for examplemap:find($json, 'firstname')[.='Michael']¶parent?surname
now finds the surnames of anyone named 'Michael', at any depth of the tree.If we turn back to the use cases in my 2016 paper on transforming JSON
https://www.saxonica.com/papers/xmlprague-2016mhk.pdf
The first use case (bulk update) relied on matching items expressed in XML as
match="map[array[@key='tags']/string='ice']/number[@key='price']/text()"
which couldn't be done in JSON because of the inability to match based on ancestor context. With the new transient properties we can match this as
match="type(xs:integer)[¶key = 'price'][¶parent?tags?* = 'ice']"
In the second use case (hierarchic inversion), we can again get properties of parent or ancestor maps
$students ! map:put("course", ¶parent?name)
I think we can also use this to define deep update operations. But I'll leave that investigation until later.
Note: transient properties potentially have many other applications, for example we might use them to solve our problems with
document-uri()
. But exploring that would be a distraction here. The nice thing about transient properties is that they give a lot of potential for augmenting existing functionality with full backwards compatibility, because we can define existing operations to return results with additional transient properties that all existing operations will ignore. If we were so minded, for example, we could have different functions/operators return "quiet NaN" and "signalling NaN" by adding a transient property to the NaN value returned.