qt4cg / qtspecs

QT4 specifications
https://qt4cg.org/
Other
28 stars 15 forks source link

Transient properties: a new approach to deep selection and update in maps and arrays #334

Closed michaelhkay closed 8 months ago

michaelhkay commented 1 year ago

After exploring many alternatives, I have come to the conclusion that we can't solve the problem of deep navigation and transformation of JSON structures without a data model change.

Most of the problems boil down to this: JSON trees do not have parent pointers, therefore after navigating down to a leaf node of the tree, we cannot get any information from higher up the tree. The solution to this (the "zipper" model) is to retain transient information about how a particular node in the tree was reached, so that we can retrace our steps and revisit nodes that were passed en route.

The change I propose is quite minor, but powerful: Any XDM value can be augmented with a set of transient properties represented as a set of key-value pairs. These properties are ignored (and typically dropped) by all operations on a value, except where otherwise specified. For the purpose of exposition, I'll use the syntax $value¶name to refer to the transient name property of $value.

We'll change the semantics of map:get() and array:get(), and the associated lookup operators, so that the resulting values have transient properties indicating how they were selected. For example, given

let $name := $person?firstName

the resulting value (perhaps the string "Michael") will be augmented with transient properties

and derived properties:

We can also define other "downward selection" operations such as map:find, and array:foot to retain these transient properties. So for example map:find($json, 'firstname')[.='Michael']¶parent?surname now finds the surnames of anyone named 'Michael', at any depth of the tree.

If we turn back to the use cases in my 2016 paper on transforming JSON

https://www.saxonica.com/papers/xmlprague-2016mhk.pdf

The first use case (bulk update) relied on matching items expressed in XML as

match="map[array[@key='tags']/string='ice']/number[@key='price']/text()"

which couldn't be done in JSON because of the inability to match based on ancestor context. With the new transient properties we can match this as

match="type(xs:integer)[¶key = 'price'][¶parent?tags?* = 'ice']"

In the second use case (hierarchic inversion), we can again get properties of parent or ancestor maps

$students ! map:put("course", ¶parent?name)

I think we can also use this to define deep update operations. But I'll leave that investigation until later.

Note: transient properties potentially have many other applications, for example we might use them to solve our problems with document-uri(). But exploring that would be a distraction here. The nice thing about transient properties is that they give a lot of potential for augmenting existing functionality with full backwards compatibility, because we can define existing operations to return results with additional transient properties that all existing operations will ignore. If we were so minded, for example, we could have different functions/operators return "quiet NaN" and "signalling NaN" by adding a transient property to the NaN value returned.

rhdunn commented 1 year ago

This is an interesting proposal. A few questions/observations:

  1. I assume we would want to extend the axis operators to take advantage of these transient properties.
  2. If we are doing (1) then we should update the data model accessors (dm:parent, etc.) to refer to transient properties for the types that support them.
  3. We would need to define the semantics of the transient properties such that if you add a map/array/etc. to two or more places that each of those have their own set of transient properties.
  4. Likewise, we need to define the semantics (or at least call out as a possible implementation issue) so that parallel/joint operations that can both modify the transient properties do not interfere with each other.

Note 1: MarkLogic made their JSON types instances of node() so that they had defined parent, etc. accessors. Given that XPath/XQuery made arrays/maps functions then that is not possible, so this is potentially a valid workaround. -- Effectively, this provides a way to make arrays/maps/etc. used in JSON and JSON-like objects into pseudo-nodes.

Note 2: This is similar/related to the proposal/idea of making the parts of atomic types (year, day, month, authority, uri-path, etc.) available to users.

michaelhkay commented 1 year ago

@rhdunn

  1. No, I don't see any need to change the way axes operate with nodes. Nodes have persistent identity and parentage, so the transient properties aren't needed.
  2. The transient properties appear only on the result of a navigation. My thinking is that as soon as you add a value to a map or array, its transient properties are lost (that's what make them transient).
  3. What the get() function returns is effectively a copy of the relevant value, with added transient properties. The original value isn't changed. In fact, everything remains immutable. This isn't a problem because maps and arrays don't have identity. In implementation terms the way I see it working is that map:get() returns an "augmented value" which is a package containing the transient properties and a pointer to the actual value; multiple retrievals of the same value return different "augmented value" objects that point to (and delegate to) the shared underlying value.
ChristianGruen commented 1 year ago

I agree, that’s a creative and promising approach.

My immediate thoughts revolve around performance: When dealing with small or medium-sized JSON files, there will be no reason to think about performance. However, I’m aware of use cases in which millions of map & array entries are processed, and it would be significant overhead to create additional properties and attach them to each traversed item in main memory. That’s particularly relevant if singletons are used for representing common atomic items.

Some years ago, we simplified our data model for similar reasons: We stored (and eventually dropped) transient score values that resulted from XQuery Full Text queries. In some cases, it’s easy to skip the data generation if it will never be needed. In other cases, it can be pretty hard.

But I should definitely spend more time on the proposal before giving a final judgement.

michaelhkay commented 1 year ago

Yes, I can think of various way of making the behaviour optional (so you don't get the properties unless you're going to need them), but I thought I would try to explore whether we can optimize the costs away first.

In the prototype Saxon implementation (see https://www.saxonica.com/documentation12/index.html#!functions/saxon/with-pedigree) you start by marking a map or array as being "with pedigree", and the "pedigree" (=transient properties) is only maintained by lookup operations that start from such a map or array.

Arithmeticus commented 1 year ago

This proposal feels liberating, and natural: just yesterday to retain key info, I pushed copies into a new map entry within the map in each one's value. Felt very hacky. And made me wonder about performance, because of possible ballooning effects.

It may be worth thinking about a mechanism that permits selective transient properties. There will be cases with very large maps where one wants to retain ¶key but not ¶parent.

I like the look and semantics of the pilcrow, but it will be a nuisance to find it. What about @@ or @ followed by some other punctuation character?

Finally, an opportunity to construct elegant predicate filter expressions.

michaelhkay commented 1 year ago

So, I promised that part 2 of the proposal would address deep update.

Let's start with an example of what it should look like to users. The following increases the prices of all products, at any depth, by 10% (returning a new value that's the same as the original in all other respects):

update(select := ??product?price, 
             change := ->($p){$p * 1.1})

How do we make this work? I'll start with a very informal explanation, and then sketch a more formal definition.

Firstly, there's a third argument $root which I've defaulted to the context item.

The select argument selects a sequence of items; you can use any expression you like in principle, but the returned items must all have the transient property ¶root equal to $root - that means in practice (a) the expression must select only within the subtree rooted at $root, and (b) it must only use operations that set or preserve the ¶root property on their result.

The change argument is a function that is applied to each selected item; the result of the function replaces the original in the corresponding position of the result tree. Internally, of course, the tree is immutable, so this means making a copy of the parent in which the relevant children have been modified, and so on recursively until you get back to the root.

If you want to add things to the tree, or delete things, then you do that by making a change to the parent. For example, to add a product, you can do update(select:=??products, change:=array:append(?, $newProduct)).

There's a complication if the selection includes a node that is an ancestor of another selected node. There are various ways we could handle this: I would propose that it an ancestor is selected and changed, then neither its old children nor its new children are further processed.

Now, how to describe this more formally? I've glossed over a number of issues that need to be addressed. For example, I said that the ¶root property in all selected values must be equal to $root -- but what does "equal" mean here? Maps and arrays, remember, have no identity.

I think the best way to tackle this is probably to give values a transient identity for the duration of the operation. So we can sketch a formal description as follows:

  1. Make a deep copy of $root in which all "nodes" are tagged with a unique but transient ¶id property, as well as other properties ¶root, ¶parent, ¶key.
  2. Evaluate the select expression on this copy (hey, we're supposed to evaluate the arguments of a function call before doing anything else - is this cheating? Perhaps step 1 has to be done as part of step 2. Or perhaps we cheat by saying that update is a pseudo-function...). Construct the list of ¶id's of the selected values.
  3. Check that the ¶root property of every selected value is the same as the ¶id property of the (copy of) $root.
  4. Perform a depth-first tree walk. If a value is in the selected list of ¶id's, then apply the $change function and copy the result to the result tree; otherwise do a shallow-copy of the value, processing its children in the same way.
  5. On completion, make a deep copy of the result tree in which all transient properties are stripped.

Of course, an actual implementation will work differently. Many implementations will use underlying data structures (such as Java immutable maps) where (a) the nodes already have a perfectly usable ID, and (b) virtual copies can be made cheaply, reusing parts of the tree that haven't changed. But we don't need to talk about that.

michaelhkay commented 1 year ago

An alternative syntax would be the rather COBOL-like

update $root changing ??product?price to (. * 1.1)

Using custom syntax rather than a function gives us more freedom in defining the semantics, but of course there are downsides as well.

ChristianGruen commented 1 year ago

One construct that has been proven to be particularly successful in BaseX is the update keyword (https://docs.basex.org/wiki/XQuery_Update#update) for main-memory node updates:

(<a/>, <b/>) update {
  rename node . as 'x'
} update {
  insert node "y" into .  
}

(:result :)
<x>y</x>
<x>y</x>

A simpler variant (for single nodes, without chaining) had been adopted to the XQuery Update 3 Facility as transform with.

Unfortunately, due to grammar restrictions, two keywords are required for XQuery Update operations (insert node instead of insert, etc.). On the other hand, many users have already become accustomed to the syntax. A similar syntax could be used for maps and arrays:

(: existing syntax for nodes :)
document {
  <root><product><price>2</price></product></root>
} update {
  //product/price ! (replace value of node . with . * 1.1)
}

(: syntax for maps and arrays? :)
[
  map { 'product': map { 'price': 2 } },
  map { 'product': map { 'price': 3 } }
] update {
  ??product?price ! (replace value of . to . * 1.1)
}
michaelhkay commented 8 months ago

Closing this as it was essentially implemented in PR #988.