qt4cg / qtspecs

QT4 specifications
https://qt4cg.org/
Other
28 stars 15 forks source link

Support sequence, array, and map destructuring declarations #37

Open rhdunn opened 3 years ago

rhdunn commented 3 years ago

Given a function that returns a sequence, array, or map of a fixed length or structure, it would be useful to extract those values in a destructuring declaration like can be done in other languages (such as JavaScript, Kotlin, C++, and Python). For example:

let $(sin, cos) := sincos(math:pi()) (: sequence :)
let $[x, y, z] := camera-angle() (: array :)
let ${r, i} := complex(1, 2) (: map :)

These would be equivalent to:

let $ret := sincos(math:pi()), $sin := $ret[1], $cos := $ret[2] (: sequence :)
let $ret := camera-angle(), $x := $ret?(1), $y := $ret?(2), $z := $ret?(3) (: array :)
let $ret := complex(1 ,2), $r := $ret?r, $i := $ret?i (: map :)

It should be possible to define the type of a component and/or the whole construct:

let $(sin as xs:float, cos) as xs:float* := sincos(math:pi()) (: sequence :)

For maps, it would also be useful to rename the components, such as:

let ${re := r, im as xs:double := i} := complex(1, 2) (: map :)

It should also be possible to capture any left-over items in the sequence/array/map, for example:

let $(headings, rows) := load-csv("test.csv")

A destructuring declaration should be usable anywhere a variable binding can be defined.

It should not be an error to use the same variable name twice. This supports conventions such as using _ for unused values. For example:

let $[_, y, _] := camera-angle()
ChristianGruen commented 3 years ago

I like the approach. Deconstruction is a powerful feature in other programming languages, and I see many use cases for it in XQuery.

If a sequence is deconstructed, I would indeed suggest assigning all remaining items to the last variable:

let $(a, b, c) := 1 to 4
return $c  (: 3, 4 :)

@rhdunn I assume you also had this in mind in the CSV example?

rhdunn commented 3 years ago

Yes, that's what I had in mind with the CSV example. It should also apply to arrays and maps -- if there is one value left, assign that value to the last variable; otherwise assign an array/map with the remaining values to the variable.

ChristianGruen commented 3 years ago

+1 for arrays. Maybe it’s not really required for maps (and I’m not sure how that’s supposed to look like?). I would guess that unreferenced keys won’t be relevant to the user most of the time. An example:

let $inc-x := function($coords) {
  map:put($coords, 'x', $x?x + 1)
}
let $coords := map { 'x': 1, 'y': 3, 'z': 0 }
let $(x, y) := $inc-x($coords)
return $x + $y
rhdunn commented 3 years ago

You are right, it would make more sense to ignore unreferenced/unnamed keys for maps.

michaelhkay commented 3 years ago

I have to confess I've never really seen the attraction of this feature in other languages, it's one that I tend not to use. That probably inclines me to the view that it's not needed.

I can see the attraction with maps/records - with the introduction of record types we're starting to introduce the idea that the names of the entries in a map might be known statically and that we can take advantage of this knowledge to make programs more readable and more statically checkable, and there's some attraction in being able to write

let $(x, y) := coordinates($rectangle)

in preference to

let $r := coordinates($rectangle), $x := $r?x, $y := $r?y

In XSLT I guess it could be <xsl:variable name="x, y" select="coordinates($rectangle)"/>

For sequences and arrays I'm less convinced, because I don't particularly want to encourage these data types to be used in a way where the semantics are non-uniform, e.g. where items 1 and 2 in a sequence/array mean different things and are used in different ways. I think the right structure for that kind of job is a map. I would always use a map, for example, for coordinates in three-dimensional space, not a sequence or an array.

rhdunn commented 3 years ago

The benefit of this feature to me is in being able to make it clear what the intent is (i.e. I'm only interested in these components), can give meaningful names to things, and write that in less code. If the item is being referred to in multiple places, I can avoid repeating the code.

Regarding arrays and sequences, one key use case would be the CSV example above where you want to filter out the headers (first row of the CSV) from the rest of the CSV. There may be other cases where this is useful -- especially for existing functions (which is a stated goal with the variadic argument work) such as MarkLogic's xdmp:tidy which returns the errors/warnings in the first item and the tidied result in the second. For example:

let $(messages, result) := xdmp:tidy($doc)
return if ($messages/tidy:error) then fn:error() else $result

I don't think the language should be constraining a feature to any one type because there is no perceived use for that, or because it is generally discouraged. It creates gaps in the language, resulting in people asking why does one feature support sequences and arrays, another maps only, and a third all three, when there is no technical reason preventing such usage.

michaelhkay commented 3 years ago

It feels to me like bad data modelling to use the first item in a sequence for headers, and subsequent items for data. I don't think we should be providing features to make it easier to manipulate data that's badly designed. Just because it was designed that way before maps/tuples were available doesn't really change that argument.

ChristianGruen commented 3 years ago

One popular use case for using deconstruction is the access to the head and tail of a sequence, which potentially allows you to save one extra variable binding:

(: old :)
let $data := ...
let $head := head($data)
let $tail := tail($data)

(: new :)
let $(head, tail) := ...

But I share your concerns: Deconstruction is also misused as quick’n’dirty replacement for cleaner data structures.

dnovatchev commented 1 year ago

@rhdunn ,

When I proposed the lazy hint @michaelhkay was asking if this proposal also covered the so called incremental evaluation (like evaluate only the head() but not the tail() of a list).

At that time I didn't remember the exact syntax of destructuring that you proposed, so I ventured with:

$(x -> head(), lazy x -> tail())

Now I am re-reading your proposal, and it seems that it will be more like:

let $h :=(head($seq), _ tail($seq))

Or even:

let $h :=(head($seq), lazy _ tail($seq))

Could you, please, advise us for what you consider the best/recommended syntax of destructuring to indicate partial/incremental evaluation?

liamquin commented 1 year ago

let ($a, $b, $c) := matches($input, '(\d+), and (\d+) gives (\d+);) return if $a + $b eq $c then $c else NaN

dnovatchev commented 1 year ago

A destructuring declaration should be usable anywhere a variable binding can be defined.

It should not be an error to use the same variable name twice. This supports conventions such as using _ for unused values. For example:

let $[_, y, _] := camera-angle()

@rhdunn What would be the syntax for destructuring an array or a sequence into head/tail where we need only the head?

I suppose something like:

let $[h, _] := $myArray

or perhaps:

let $[h, _] := $myArray ! (head(), tail())

I support the use of the _ (underscore) character as a wildcard symbol.

Furthermore, in the spirit of the lazy keyword , I propose that any such wildcard symbol is by definition lazy, thus we don't have to write:

let $[h, lazy _] := $myArray ! (head(), tail())

because by definition it has exactly the same meaning if the lazy keyword is omitted.

ChristianGruen commented 1 year ago

Furthermore, in the spirit of the https://github.com/qt4cg/qtspecs/issues/299 , I propose that any such wildcard symbol is by definition lazy, thus we don't have to write:

Hi @dnovatchev, personally, I think it’s misleading to call an operation lazy if it will never happen. Think of e.g. if(A) then B else C: C is not lazy either, it’s just a branch that is not evaluated.

Next, I believe it’s usually not the assignment of the result that’s expensive, but the evaluation of the expression itself. There are many cases, such as sorting, that require the full value to be computed before an assignment can take place:

let $(head, tail) := sort($data)

If the tail of a value is irrelevant, I would usually expect the optimizer to take care of the advanced work. The following lat clause…

let $(head, _) := EXPENSIVE
return $head

…could easily be rewritten by the compiler to:

let $head := head(EXPENSIVE)

On the other hand, this might be done by the user anyway, so I’m not sure if we need a placeholder for unused values? Maybe we could define 2, 3 simple use cases for that possible requirement?

rhdunn commented 1 year ago

A destructuring declaration should be usable anywhere a variable binding can be defined. It should not be an error to use the same variable name twice. This supports conventions such as using _ for unused values. For example:

let $[_, y, _] := camera-angle()

@rhdunn What would be the syntax for destructuring an array or a sequence into head/tail where we need only the head?

I suppose something like:

let $[h, _] := $myArray

or perhaps:

let $[h, _] := $myArray ! (head(), tail())

I support the use of the _ (underscore) character as a wildcard symbol.

_ is not specifically a wildcard symbol as $_ is a valid variable declaration. The use of $_ is a common convention (esp. in MarkLogic XQuery) for "I don't care about the result of this expression, I just want to call the function/perform something that I don't care about the return value of".

There is a bit of bike-shedding around the syntax:

  1. how the sequences/arrays/maps are specified -- e.g. let $[v1, v2, v3] vs let [$v1, $v2, $v3]. I'm now inclined to go with the latter, as the former can be confusing.
  2. how "the rest of the sequence/array/map" is specified -- i.e. if the last item collates the remaining entries, or if you need to declare that (e.g. via something like $rest ...). I'm currently inclined to keep it the former, as it is easier w.r.t. error handling, however that opens it up to confusion as let [$a] will select the entire array an not match a single item in the array.
dnovatchev commented 1 year ago

On the other hand, this might be done by the user anyway, so I’m not sure if we need a placeholder for unused values? Maybe we could define 2, 3 simple use cases for that possible requirement?

@ChristianGruen Seems we just need to have a look at how destructuring is defined in other languages. I believe at least some (if not all) of them have a wild-card mechanism.

ChristianGruen commented 1 year ago

Seems we just need to have a look at how destructuring is defined in other languages. I believe at least some (if not all) of them have a wild-card mechanism.

That would certainly be helpful. Which languages do you know that provide placeholders for unused values?

rhdunn commented 1 year ago

Seems we just need to have a look at how destructuring is defined in other languages. I believe at least some (if not all) of them have a wild-card mechanism.

That would certainly be helpful. Which languages do you know that provide placeholders for unused values?

From Kotlin: https://kotlinlang.org/docs/destructuring-declarations.html#underscore-for-unused-variables.

dnovatchev commented 1 year ago

Seems we just need to have a look at how destructuring is defined in other languages. I believe at least some (if not all) of them have a wild-card mechanism.

That would certainly be helpful. Which languages do you know that provide placeholders for unused values?

From Kotlin: https://kotlinlang.org/docs/destructuring-declarations.html#underscore-for-unused-variables.

For C#:

Tuple elements with discards

User-defined type with discards

michaelhkay commented 1 year ago

So let's try to summarise what's being proposed.

  1. For sequences (that is, anything):

let $(a, b, c) := $sequence

is equivalent to let $a := $sequence[1], $b := $sequence[2], $c := subsequence($sequence, 3)

and using "_" as a variable name causes no binding to take place for that position.

  1. For arrays:

let $(a, b, c) := $array

is equivalent to let $a := $array?1, $b := $array?2, $c := subarray($array, 3)

and using "_" as a variable name causes no binding to take place for that position.

(or do we want to do the equivalent without bound-checking?)

Note that a and b are of type item() while c is of type array(item()). But that doesn't seem to work for the case where you know the array has length 3 and you want to bind $a, $b, and $c to its three members.

  1. For maps:

let ${x, y} := $map

is equivalent to let $x := $map?x, $y := $map?y

and in this case x and y must be NCNames (because using QNames gets too complicated).

Have I got that right?

ndw commented 1 year ago

I often construct maps with QName keys. This is based on the practice introduced, I think, in XSLT many years ago, that values in no namespace are reserved for the specification and any extension values have to be in a namespace. That's used throughout XProc, for example, where maps are used for things like output options.

I've used destructuring in other languages, like Scala. I confess I don't ever recall ever writing an XPath expression and thinking "gee, I wish I could use destructuring here" but that's as likely because I've learned XPath without it as anything else. I'm not opposed to adding destructuring.

But I think I am opposed to adding it if it's limited in ad hoc ways. If I can destructure a map with string keys, I should be able to destructure a map with QName keys. Or perhaps destructuring simply shouldn't apply to maps.

ChristianGruen commented 1 year ago

and using "_" as a variable name causes no binding to take place for that position.

It feels restrictive to me to reserve a string that is a legal variable name. What about using the dash character -?

let $[-, y, -] := camera-angle()

(or do we want to do the equivalent without bound-checking?)

+1 for bound checks.

Note that a and b are of type item() while c is of type array(item()). But that doesn't seem to work for the case where you know the array has length 3 and you want to bind $a, $b, and $c to its three members.

I think we should avoid using different item types for the variables (the cardinality may be different, though). We could wrap all members into arrays:

let $[x, y] := [ 1, (2, 3), 4 ]
return (
  $x  (: [ 1 ] :),
  $y  (: [ 2, 3 ], [ 4 ] :),
)

This way, we could write:

(: moves the first member of an array to the end :)
let $[x, y] := array { ... }
return array:join(($y, $x))

Array entries could be another option (#314). I hope my interpretation of the syntax is correct?

let $[x, y] := [ 1, (2, 3), 4 ]
return (
  $x  (: map { 'value': 1 } :),
  $y  (: map { 'value': (2, 3) }, map { 'value': 4 } :),
)

As sequences and arrays are handled differently, and as we have bound checks, we could be strict and raise an error if the number of variables and members differs at runtime:

let $[x, y] := [ 1, 2, 3 ]  (: error :)

We could also treat maps and arrays similarly and ignore all members for which no variables are declared:

let $[x, y] := [ 1, 2, 3, 4, 5 ]
return $y  (: 2 :)
rhdunn commented 1 year ago

@ChristianGruen Note that there is nothing special about the use of _ in my examples. You could just as easily write:

let $[unused, y, unused] := camera-angle()

I used _ as that is a common variable name for a value that is not used. It is up to the processor to determine that the variable is not used and optimize as needed.

I'm happy to come up with an unused specifier if you (or other processor implementors) think that the current proposal does not give you enough information to elide the expansion. -- I think something in this regard would make sense as the processor then does not need to emit a corresponding let expression, figure out that it is not used, and then remove the let expression it just added.

michaelhkay commented 1 year ago

I think the problem with that is that it requires a complicated rule saying you can only include the same variable name twice in a declaration if the variable is never used. Without that rule, we'd be allowing you to bind the same variable to two different things. Alternatively, I guess we could just say "last one wins", as it does when you say "let $x := 3, $x := 4".

rhdunn commented 1 year ago

I'm advocating a "last one wins" approach -- i.e. all this proposal is is syntactic sugar for a multi-let assignment to parts of a sequence/array/map.

rhdunn commented 7 months ago

An alternate syntax discussed in #31 is:

let ($sin, $cos) := sincos(math:pi()) (: sequence :)
let [$x, $y, $z] := camera-angle() (: array :)
let {$r, $i} := complex(1, 2) (: map :)

and equivalent for syntax.

This would require adding for and let to the reserved function names to avoid let(...) and for(...) being interpreted as functions.