Filtering by type in lookup expressions

michaelhkay commented 1 month ago

We have dropped the syntax ??type(T) for filtering the results of lookup expressions, because of problems with syntax ambiguity. This issue seeks an alternative.

Although selection by type also makes sense with shallow lookup, it is most relevant with deep lookup. The main need arises with intermediate steps of a path such as ?? X ?? Y which gives a dynamic error if X selects something that is not a map or array. This is consistent at one level with // X // Y, except that // X can never select something that isn't a node.

The main problems with filtering using an [. instance of record(p, q)] predicate is that it's very long-winded. For example, if we want to select only those members of a selected array that are sequences of a particular record type, without flattening everything else, we have to write something like ?? values::* ?[. instance of record(p, q)+] ? *, which is a bit of a nightmare.

Starting from the end goal, I would like to be able to write something close to ??record(first, last) to select all the items of this record type at any depth. We know that syntax doesn't work, because ??NCName is already taken. That's also true for ??items::record(first, last), unless we change the rules for what can appear after ::.

Also, there's another syntax hazard: what we want here is a SequenceType, not an ItemType, and that means that it can contain a trailing ? occurrence indicator, which is easily confused with the next lookup operator in a path.

Looking at it from all angles, I do feel the best solution is to prefix the record(first, last) with a marker character so that we know we've got a type filter here. Characters that might do the job include @, #, $, %, ^, ~. Of these, my preference remains ~, for three reasons:

(a) it's currently unused: overloading a different symbol is more likely to cause visual confusion (b) one of the traditional uses of ~ is to indicate a "matches" or "is kind of like" relationship. (c) there's a mnemonic association between "tilde" and "type" (compare "at" and "attribute")

johnlumley commented 1 month ago

For named types would we use a construct ??~type(FOO) and for atomics ??~xs:integer or even ??~integer given xs: default?

michaelhkay commented 1 month ago

I'm working backwards from the common case of selection using a record type to the more general case (just as path expressions focus on having convenient syntax for the common cases).

But I think we could achieve something like

KeySpecifier ::= .... | "~" SequenceType

but allowing the SequenceType to be in parentheses, or perhaps requiring it to be in parentheses if there is an occurrence indicator, which would make it "~" ( ItemType | "(" SequenceType ")")

ChristianGruen commented 1 month ago

Let’s assume we have XML encoded either in a document or in a “structured item” (which is how we occasionally call maps/arrays internally). Are the following two expressions comparable to some extent / would they both return the element <a/>?

let $doc := document { <a/> }
return $doc / element()

let $struct := [ <a/> ]
return $struct ?~ element()

If we wanted to try to make the syntax accessible to non-experts, would it be fair to present / and ?~ as somewhat equivalent?

michaelhkay commented 1 month ago

It would be great if we could agree on a collective term for "maps and arrays". "Structured item" feels too generic to me. I've toyed with terms like "tabulation", "tabula", "composition", "dataset", "compendium", "aggregate".

Perhaps "combo"? It's best to have a word that stands out from the crowd if we can't find one whose meaning is self-explanatory.

With "/", the RHS is always selecting nodes, and we are primarily selecting nodes by nodekind and name, occasionally by type. So we can write a/element(*, xs:integer) but we rarely need to, because element names usually provide the handle that we need. With JSON, we don't have element names, so selecting by type becomes a much more common requirement.

The syntax a/element() works only because element is reserved as a function name. We don't have the luxury of reserving any names after "?" in the same way. Logically we could think of a/element() as an abbreviation for a/~element(), where the ~ can be omitted because element is a reserved name.

johnlumley commented 1 month ago

Is there any restriction on using something like element as an ItemType name? (I can only see restrictions against using atomic type names). If the are none, then a/~element would be legal (assuming suitable declaration), but somewhat confusing!

michaelhkay commented 1 month ago

There's no restriction on using bare NCNames as atomic type names or declared item type names. It's quite legal today to do a/element(element, element).

dnovatchev commented 1 month ago

My first reaction is to use syntax like:

?? X ?? Y::map

or

?? X ?? Y[isMap(.)]

or

? X ?? maps(Y)

or

?? X ?? Y[hasKeys(.)]

Or why not:

?? X ?? map::Y

I am against introducing new, unreadable symbols in the already quite messed symbol-set we are using at present.

Readability must have much higher priority in our design than introducing new, fancy (cryptic) symbols.

dnovatchev commented 1 month ago

And of course, if the proposal for Total Maps is accepted,

Then any constant non-map value can be represented-as / coerced-to a map:

map {
'\' : ()
}      (: produces the empty sequence  for any lookup:)

qt4cg / qtspecs

Filtering by type in lookup expressions #1456