qt4cg / qtspecs

QT4 specifications
https://qt4cg.org/
Other
29 stars 15 forks source link

Generators in XPath #716

Open dnovatchev opened 1 year ago

dnovatchev commented 1 year ago

What is a generator?

Generators are well known and provided out of the box in many programming languages. Per Wikipedia:

“In computer science, a generator is a routine that can be used to control the iteration behaviour of a loop. All generators are also iterators.[1] A generator is very similar to a function that returns an array, in that a generator has parameters, can be called, and generates a sequence of values. However, instead of building an array containing all the values and returning them all at once, a generator yields the values one at a time, which requires less memory and allows the caller to get started processing the first few values immediately. In short, a generator looks like a function but behaves like an iterator.”


The goal of this proposal (major use-cases)

A generator in XPath should be a tool to easily implement the solutions to the following use-cases:

  1. Processing a huge collection whose members may not all be needed.
    A generator will produce only the next member of the collection and only on demand basis.

  2. Handling a collection containing unknown or infinite number of members. When requested the next member of the collection the generator will always produce it, if the collection still contains any members. It is the responsibility of the caller to issue only the necessary number of requests for the really needed next members.

What is achieved in both cases:

A good problem that is based on these use-cases is to generate a collection of the first N members that have some wanted properties, and are generated from other collection(s), when it is not known what the size of the original input collections would be in order for the desired number of N members to be discovered.

For example: Produce the first 1 000 000 (1M) prime numbers.

Sometimes we may not even know if N such wanted members actually exist, for example: _Produce the first 2 sequences of 28 prime numbers where the primes in each of the sequences form an arithmetic progression._


The Proposal

A generator is defined as (and synonym for):

let $generator as record
                   (initialized as xs:boolean,
                    endReached as xs:boolean,
                    getCurrent as function(..) as item()*,
                    moveNext as function(..) as .. ,
                    *  ) 

A generator is an extensible record .

It has four fixed-named keys, and any other map-keys, as required to hold the internal state of that specific generator.

Here is the meaning of the four fixed/named keys:

Examples of operations on generators

The following examples are written in pseudo-code as at the time of writing there was no available implementation of records. And also, the code for recursion in pure XPath makes any such example longer than necessary for grasping its meaning.

The Empty Generator

emptyGenerator() {
             map{
                 initialized : true(),
                 endReached: true(),
                 getCurrent: function($this as map(*)) {error()},
                 moveNext: function($this as map(*)) {error()}
                }
              }

Take the first N members of the collection

take($gen as generator, $n as xs:integer) as generator
{
  let $gen := if(not($gen?initialized)) then $gen?moveNext()
                      else $gen,
      return
         if( $gen?endReached or $n eq 0) then emptyGenerator()
            else map:put($gen, "moveNext", take($gen?moveNext($gen) ), $n -1  )
}

Skip the first N members from the collection

skip($gen as generator, $n as xs:integer) as generator
{
  if($n eq 0) then $gen
     else
     {
         let $gen := if(not($gen?initialized)) then $gen?moveNext()
                       else $gen
           return
             if(not($gen?endReached) then skip($gen?moveNext(), $n -1)
               else $gen
     }             
}

Subrange of size N starting at the M-th member

subrange($gen as generator, $m as xs:integer, $n as xs:integer) as generator
{
  take(skip($gen, $m -1), $n)
}

Head of a generator

head($gen as generator)
{
  take($gen, 1)?getCurrent()
}

Tail of a generator

tail($gen as generator)
{
  skip($gen, 1)
}

At index N

at($ind as xs:integer)
{
  subrange($ind, 1)?getCurrent()
}

For Each

for-each($gen as generator, $fun as function(*))
{
   map:put($gen, "getCurrent", function() { $fun($gen?getCurrent()) }  )                              
}

For Each Pair

for-each-pair($gen1 as generator, $gen2 as generator, $fun as function(*))
{
   let $gen1 := if(not($gen1?initialized)) then $gen1?moveNext()
                  else $gen1,
       $gen2 := if(not($gen2?initialized)) then $gen2?moveNext()
                  else $gen2,
    return
      if($gen1?endReached or $gen2?endReached) then map:put($gen1, "endReached", true())
        else map:put(map:put($gen1, "getCurrent", function() { $fun($gen1?getCurrent(), $gen2?getCurrent()) } ) ,
                     "moveNext", function() { for-each-pair(skip($gen1, 1), skip($gen2, 1), $fun)}
                    )                             
}

Filter

filter($gen as generator, $pred as function(item()*) as xs:boolean)
{
     let $getNextGoodValue := function($gen as map(*), $pred as function(item()*) as xs:boolean)
         {
            let $mapResult := iterate-while(
                                            $gen,
                                            function($gen) { not($pred($gen?getCurrent($gen))) },
                                            function($gen) { $gen?moveNext($gen) }
                                            )   
            return $mapResult?getCurrent($mapResult)                     
         },
       $gen := if($gen?initialized) then $gen 
                      else $gen?moveNext($gen)
        return
          map {
               "initialized": true(),
               "endReached":  $gen?endReached,
               "getCurrent": function($this as map(*)) { $getNextGoodValue($this?inputGen, $pred) },
               "moveNext":   function($this as map(*))
                             {    let $nextGoodValue := $getNextGoodValue($this?inputGen?moveNext($this?inputGen), $pred),
                                      $nextGen := iterate-while(
                                                                $this?inputGen?moveNext($this?inputGen),
                                                                function($gen) { not($pred($gen?getCurrent($gen))) },
                                                                function($gen) { $gen?moveNext($gen) }
                                                                )
                                    return
                                      map {
                                            "initialized": $nextGen?initialized,
                                            "endReached":  $nextGen?endReached,
                                            "getCurrent" : function($x) {$nextGoodValue},
                                            "moveNext" :   $this?moveNext,
                                            "inputGen" :   $nextGen
                                           }
                             },
               "inputGen" : $gen
              }
        }

Here are some other useful functions on generators -- with just their signature and summary:


These and many other useful functions on generators can and should be added to every generator upon construction.

Thus, it would be good to have an explicit constructor function for a generator:

     construct-generator($record as 
                               record( initialized as xs:boolean,
                                       endReached as xs:boolean,
                                       getCurrent as function(..) as item()*,
                                       moveNext as function(..) as .. ,
                                      )
                         ) as generator
michaelhkay commented 1 year ago

Thanks, that looks like a very coherent proposal, and it's easy to understand as a generalisation of the design pattern used by fn:random-number-generator.

Looking at it again, though, I wonder what we're actually proposing adding to the spec? Is it a set of functions like skip() and take() plus a definition of the record type generator that they operate on, plus some kind of user guide on how to implement your own generators?

In my own experience I've found that the easiest way to use fn:random-number-generator() is with xsl:iterate, though presumable fn:iterate-while() will also work well. Examples that show the interaction of these capabilities would be useful.

It would be good to see some examples that show how the concept can be applied to processing of XML and JSON. Perhaps there's some kind of possibility of a function like stream(uri-of-xml-document, simple-pattern) that delivers a generator whose delivered items are subtrees of the XML document that satisfy the simple-pattern, where the simple-pattern is some kind of restricted predicate that can test nodes against local conditions like their name, attributes, and ancestry.

michaelhkay commented 1 year ago

Note also issue #295, which points out the need for extensions to the record-type syntax to allow record types of the kind you are using, with self-reference in function arguments and result.

michaelhkay commented 1 year ago

I guess two more useful functions on generators might be for-each() and filter() that return new generators.

It would seem appealing to make the functions skip(), take(), for-each(), filter() etc be accessible as fields of the generator itself, so we can write $my-generator?take(10). But since the generator is user-written, that requires some kind of inheritance mechanism so someone implementing a generator gets these methods "for free". Perhaps this is a bridge too far... If we provide free-standing functions like generator:take($generator, $count) then it's easy enough for the user implementing a generator to wrap these as methods on the generator if they choose.

Another thought is whether and how one might allow the internal state of a generator to be private.

dnovatchev commented 1 year ago

Thanks, that looks like a very coherent proposal, and it's easy to understand as a generalisation of the design pattern used by fn:random-number-generator.

Looking at it again, though, I wonder what we're actually proposing adding to the spec? Is it a set of functions like skip() and take() plus a definition of the record type generator that they operate on, plus some kind of user guide on how to implement your own generators?

This is the minimum we can do.

We could also think of extending the definition of some of the standard functions that operate on sequences and on arrays, to accept as first argument anything that implements head and tail. If this can be done, then we will not need to add new overloads for these functions at all.

In my own experience I've found that the easiest way to use fn:random-number-generator() is with xsl:iterate, though presumable fn:iterate-while() will also work well. Examples that show the interaction of these capabilities would be useful.

Exactly. A generator goes well as argument to any function that has an accumulator-like argument that will contain the latest generator, produced in the latest iteration (or latest recursive call), such as folds, _iterate-XXX_, _while-XXX_

And generally being passed as one of the arguments of a call to a recursive function, as showed in the examples in the proposal.

It would be good to see some examples that show how the concept can be applied to processing of XML and JSON. Perhaps there's some kind of possibility of a function like stream(uri-of-xml-document, simple-pattern) that delivers a generator whose delivered items are subtrees of the XML document that satisfy the simple-pattern, where the simple-pattern is some kind of restricted predicate that can test nodes against local conditions like their name, attributes, and ancestry.

Yes, I have a few examples in mind, but based on the feedback (including yours) from my previous/other issues, I wanted to keep this one as simple and short as possible.

Do note that a generator can encapsulate any kind of collection - not only a sequence or an array, but also maps, trees, XML documents, infinite or of unknown size collections, ... , you name it. The same goes to the traversal strategy that a generator can use Breadth-First, Dept-First, left-to-right, right-to-left, oscillating, ... , again you name it.

dnovatchev commented 1 year ago

I guess two more useful functions on generators might be for-each() and filter() that return new generators.

It would seem appealing to make the functions skip(), take(), for-each(), filter() etc be accessible as fields of the generator itself, so we can write $my-generator?take(10). But since the generator is user-written, that requires some kind of inheritance mechanism so someone implementing a generator gets these methods "for free". Perhaps this is a bridge too far... If we provide free-standing functions like generator:take($generator, $count) then it's easy enough for the user implementing a generator to wrap these as methods on the generator if they choose.

Yes. These can easily be implemented as standard (universally available) decorators (or just a single decorator adding all of these functions to the result-generator) on a generator, thus the author of a generator doesn't have to implement them at all.

Such a decorator takes a generator and returns another generator that is the result of adding to the original one the missing functions.

Another thought is whether and how one might allow the internal state of a generator to be private.

Right now there is no concept of private in XPath.

dnovatchev commented 1 year ago

I guess two more useful functions on generators might be for-each() and filter() that return new generators.

Here are the two additional requested functions on generators: for-each and filter, plus two additional: for-each-pair and at.

I am also placing them into the main proposal:

At index N

at($ind as xs:integer)
{
  subrange($ind, 1)
}

For Each

for-each($gen as generator, $fun as function(*))
{
   map:put($gen, "getCurrent", function() { $fun($gen?getCurrent()) }  )                              
}

For Each Pair

for-each-pair($gen1 as generator, $gen2 as generator, $fun as function(*))
{
   let $gen1 := if(not($gen1?initialized)) then $gen1?moveNext()
                  else $gen1,
       $gen2 := if(not($gen2?initialized)) then $gen2?moveNext()
                  else $gen2,
    return
      if($gen1?endReached or $gen2?endReached) then map:put($gen1, "endReached", true())
        else map:put(map:put($gen1, "getCurrent", function() { $fun($gen1?getCurrent(), $gen2?getCurrent()) } ) ,
                     "moveNext", function() { for-each-pair(skip($gen1, 1), skip($gen2, 1), $fun)}
                    )                             
}

Filter

filter($gen as generator, $pred as function(item()*) as xs:boolean)
{
  let $gen := if(not($gen?initialized)) then $gen?moveNext()
                else $gen,
      $moveNext := function() { filter($gen?moveNext(), $pred) }
      return if($gen?endReached or $pred($gen?getCurrent())) 
               then map:put($gen, "moveNext", $moveNext)
               else filter($gen?moveNext(), $pred)
}
dnovatchev commented 1 year ago

Update:

Here is a more precise, non-(explicitly)recursive code for filter acting on a generator.

Do note:

Filter

filter($gen as generator, $pred as function(item()*) as xs:boolean)
{
     let $getNextGoodValue := function($gen as map(*), $pred as function(item()*) as xs:boolean)
         {
            let $mapResult := iterate-while(
                                            $gen,
                                            function($gen) { not($pred($gen?getCurrent($gen))) },
                                            function($gen) { $gen?moveNext($gen) }
                                            )   
            return $mapResult?getCurrent($mapResult)                     
         },
       $gen := if($gen?initialized) then $gen 
                      else $gen?moveNext($gen)
        return
          map {
               "initialized": true(),
               "endReached":  $gen?endReached,
               "getCurrent": function($this as map(*)) { $getNextGoodValue($this?inputGen, $pred) },
               "moveNext":   function($this as map(*))
                             {    let $nextGoodValue := $getNextGoodValue($this?inputGen?moveNext($this?inputGen), $pred),
                                      $nextGen := iterate-while(
                                                                $this?inputGen?moveNext($this?inputGen),
                                                                function($gen) { not($pred($gen?getCurrent($gen))) },
                                                                function($gen) { $gen?moveNext($gen) }
                                                                )
                                    return
                                      map {
                                            "initialized": $nextGen?initialized,
                                            "endReached":  $nextGen?endReached,
                                            "getCurrent" : function($x) {$nextGoodValue},
                                            "moveNext" :   $this?moveNext,
                                            "inputGen" :   $nextGen
                                           }
                             },
               "inputGen" : $gen
              }
        }

Below is a complete XPath 4.0 expression which, besides the above definitions, produces a specific filtered generator, whose input is the generator of integers from 2 to Infinity, and the predicate "approves" any odd number. The first 999 members of this generator are materialized as the result of this expression:

let $gen2ToInf :=
       map{
            "initialized": true(),
            "endReached": false(),
            "getCurrent": function($this as map(*)) { $this?last + 1},
            "moveNext":  function($this as map(*))
                         {
                           if(not($this?initialized)) then map:put($this, "initialized", true())
                             else map:put($this, "last", $this?last +1)
                         },
            "last" : 1 
           },           
   $getNextGoodValue := function($gen as map(*), $pred as function(item()*) as xs:boolean)
                        {
                            let $mapResult := iterate-while(
                                                            $gen,
                                                            function($gen) { not($pred($gen?getCurrent($gen))) },
                                                            function($gen) { $gen?moveNext($gen) }
                                                            )   
                            return $mapResult?getCurrent($mapResult)                     
                        },           
   $filter := function($gen as map(*), $pred as function(item()*) as xs:boolean)
              {
                     let $gen := if($gen?initialized) then $gen 
                                    else $gen?moveNext($gen)
                      return
                        map {
                             "initialized": true(),
                             "endReached":  $gen?endReached,
                             "getCurrent": function($this as map(*)) { $getNextGoodValue($this?inputGen, $pred) },
                             "moveNext":   function($this as map(*))
                           {    let $nextGoodValue := $getNextGoodValue($this?inputGen?moveNext($this?inputGen), $pred),
                                    $nextGen := iterate-while(
                                                              $this?inputGen?moveNext($this?inputGen),
                                                              function($gen) { not($pred($gen?getCurrent($gen))) },
                                                              function($gen) { $gen?moveNext($gen) }
                                                              )
                                  return
                                    map {
                                          "initialized": $nextGen?initialized,
                                          "endReached":  $nextGen?endReached,
                                          "getCurrent" : function($x) {$nextGoodValue},
                                          "moveNext" :   $this?moveNext,
                                          "inputGen" :   $nextGen
                                         }
                           },
                             "inputGen" : $gen
                            }
                      }
   return
      fold-left(1 to 999, [$filter($gen2ToInf, fn($x) {$x mod 2 eq 1} ), []], 
                fn($accum, $mem) 
                {
                  let $nextGen := $accum(1)?moveNext($accum(1))
                   return  [$nextGen, array:append($accum(2), $nextGen?getCurrent($nextGen))]
                }
               )
               (2)

Anyone can execute this code with BaseX ver. 10.8 and produce the result:

image

dnovatchev commented 1 year ago

This is an open question to the knowledgeable reader:

Can we extend the type of arguments of any currently existing standard XPath functions (and operators) that are (the argument types) sequence, array or map, to a more generalized type that is the union type of these types and the generator type?

This will result in immediately having these standard functions become much more powerful and encompassing, without impacting backwards compatibility in any existing pre- XPath 4.0 code, and no need for additional overloads.

michaelhkay commented 1 year ago

Can we extend...

Not if the argument type is sequence, because there is no type that i more general than sequence, so the union of sequence with any other type is sequence.

When we introduced arrays, we spent at least a year on trying to devise a solution in which sequences and arrays were subtypes of something more general. Sadly, it proved too difficult.

michaelhkay commented 1 year ago

One way forward might be to use annotations as a sort of interface/prototype mechanism. We could say, for example, that if $X is annotated %prototype=map{"count" : array:size#1, "head" : array:head#1, ...) then '%count($X)returnsarray:size($X)and%head($X)returnsarray:head($X)`. But working out all the details would be challenging, and the final result might be quite confusing to users.

dnovatchev commented 1 year ago

One way forward might be to use annotations as a sort of interface/prototype mechanism. We could say, for example, that if $X is annotated %prototype=map{"count" : array:size#1, "head" : array:head#1, ...) then '%count($X)returnsarray:size($X)and%head($X)returnsarray:head($X)`. But working out all the details would be challenging, and the final result might be quite confusing to users.

I find annotations rather messy.

A better and well-tested in practice approach is to say that everything is an object. An object has a value and some other properties, like collection-type whose values specify what type of collection the object?value is, and this can be one of : map, array, generator, sequence .

This is in full accordance with https://github.com/qt4cg/qtspecs/issues/105 following which every value can be represented as a map that has no regular keys and only a default key whose value is the value. We can add a special key named object-properties to this map, which could be something that is not allowed as a regular key until XPath 4, and have all object properties (among them collection-type ) as properties of the map that is the value of the key object-properties.

Any object has a set of standard properties, and can have its own-properties.

Any value (item(*)) can give us its containing object as the result of evaluating a new function:

object($value as item()*)

I know this sounds rather complicated.

For the purpose of the current discussion, we could have a simpler solution, where we have type-hints such as:

no-sequence(ExprProducingSingleItem)

telling the processor that the value of ExprProducingSingleItem must not be regarded as a sequence.

In this way, legacy code that isn't aware of the type-hints will continue to treat ExprProducingSingleItem generally as a sequence of one item.

But if an argument is passed with the no-sequence hint, its collection-type would be one of the remaining known collection-types (array, map, generator)

dnovatchev commented 1 year ago

For the purpose of the current discussion, we could have a simpler solution, where we have type-hints such as:

no-sequence(ExprProducingSingleItem)

telling the processor that the value of ExprProducingSingleItem must not be regarded as a sequence.

In this way, legacy code that isn't aware of the type-hints will continue to treat ExprProducingSingleItem generally as a sequence of one item.

But if an argument is passed with the no-sequence hint, its collection-type would be one of the remaining known collection-types (array, map, generator)

We could simply use as a hint surrounding an expression (that evaluates to a single, non-sequence-collectionItem (array, map or generator) by back-slashes:

\$myGenerator\

This means:

Dear processor, use $myGenerator not as a sequence of one item, but as the generator of the input collection on which we want to apply the function.

Thus,

tail(\[1, 2, 3]\)

will produce:

[2, 3]

and will not produce ()

In this way we can have and use only one single set of collection-processing functions, regardless of the type of the collection - be it a sequence, an array, a map or a generator.

michaelhkay commented 1 year ago

It seems that the proposed backslash-notation is not defining \[1, 2, 3]\ as an expression, but rather as a modifier of the syntax of a static function call which somehow changes the semantics of the function call, in effect causing the function name to bind to a different function with a different signature? This all seems very lacking in orthogonality.

If annotations are messy, this is downright disheveled.

dnovatchev commented 1 year ago

It seems that the proposed backslash-notation is not defining \[1, 2, 3]\ as an expression, but rather as a modifier of the syntax of a static function call which somehow changes the semantics of the function call, in effect causing the function name to bind to a different function with a different signature? This all seems very lacking in orthogonality.

If annotations are messy, this is downright disheveled.

No, not different functions that have different signature but just one function.

Exactly the opposite: this is a single function that accepts a generator. If using \ is visually displeasing, let us have:

We can think of the three types: sequence-generator, array-generator and map-generator as distinct (non-overlapping) subtypes of the generator type.

Updated with Conclusion:

Thus, we can have just a single set of collection-processing functions, that can take as input any type of collection: be it a sequence, a singleton-array, a singleton-map, or any other collection represented by a generator.

ChristianGruen commented 1 year ago

Closely related (both issues should be discussed together): #708.

dnovatchev commented 1 year ago

Closely related (both issues should be discussed together): #708.

These are not "both issues". This is the same issue, which in #708 is just mentioned and touched on the surface, while here we have done a deep analysis and a full design and proposal.

So, to be clear: there is one issue and there is one complete proposal.

ChristianGruen commented 1 year ago

So, to be clear: there is one issue and there is one complete proposal.

This wasn’t meant to be offensive, Dimitre.

dnovatchev commented 1 year ago

Thanks to all for today's discussion.

To address @rhdunn 's statement that the boolean initialized member is not necessary, and that no such property is there in the IEnumerator interface in .NET, here is a quote of the Microsoft's documentaion:

"Current is undefined under any of the following conditions:

Thus, it is clearly stated that immediately after the enumerator is created an (in fact initialization) action is needed (calling the MoveNext method) so that the enumerator will be able to expose a correct value of its Current property.

In the current proposal the initialized member serves exactly the purpose of indicating whether or not a call to moveNext is necessary in order to ensure that a call to getCurrent can produce a correct result.

Thus, rather than mentioning this important fact as something minor in the Remarks, here we choose to have a property that explicitly shows the state of the generator -- being either initialized or non-initialized. Operating on a uninitialized generator can no longer "go unnoticed" or be blamed on the caller. Instead, initialization (calling moveNext) can always be done whenever initialized is false.

rhdunn commented 1 year ago

That just means that in order to traverse an IEnumerator, you need to write:

    IEnumerator<T> enumerator = createGenerator();
    while (enumerator.MoveNext()) {
        T current = enumerator.Current;
        // ...
    }

This is similar to how Java Iterator works, where you write:

    Iterator<T> iterator = createGenerator();
    while (iterator.hasNext()) {
        T next = iterator.next();
        // ...
    }

Using Kotlin, you can implement a NodeList Java Iterator as:

class NodeListIterator(val nodes: NodeList) : Iterator<Node> {
    private var current: Int = 0

    // Returns true if there are more elements. Does not advance the iterator.
    override fun hasNext(): Boolean = current != nodes.length

    // Returns the next element. Advances the iterator.
    override fun next(): Node = nodes.item(current++)
}

In C# -- because of IEnumerable semantics -- this would be something like:

class NodeListIterator : IEnumerable {
    NodeList nodes;
    int current = -1; // Positioned before the first element.

    public NodeListIterator(NodeList nodes) {
        this.nodes = nodes;
    }

    // Move to the next item. Adances the iterator.
    public bool MoveNext() {
        return position++ < nodes.Length;
    }

    // Get the current item. Does not advance the iterator.
    public Node Current {
        get { return nodes.Item(current); }
    }
}

C# has different semantics to Java because the Current item is a property and is named Current. This allows a user to access Current multiple times in the function without moving the iterator. In Java you would need to place the result of next() in a temporary variable to avoid issues with multiple access.

michaelhkay commented 1 year ago

C# has different semantics to Java because the Current item is a property and is named Current. This allows a user to access Current multiple times in the function without moving the iterator. In Java you would need to place the result of next() in a temporary variable to avoid issues with multiple access.

The differences run deeper than this (which makes transpiling Java to C# painful!). In C#, MoveNext changes the state, while in Java, hasNext() must be stateless - you can call it multiple times before a call on next().

dnovatchev commented 1 year ago

Thanks Mike and Reece!

Anyway, having an explicit initialization-state property is something helpful for all developers here and this is its intended goal.

rhdunn commented 1 year ago

I think it would be useful to try and write different generator functions that cover the different expected uses/patterns so we can work out what properties and associated semantics are needed. That way, we would know if an initialized property is useful or if it is not needed.

michaelhkay commented 1 year ago

In short, a generator looks like a function but behaves like an iterator.”

Looking at this thread again, what's being described here looks much more like an iterator than a generator. To me the essential characteristic of a generator is that it's written like a coroutine; a function that yields result items one at a time.

dnovatchev commented 1 year ago

In short, a generator looks like a function but behaves like an iterator.”

Looking at this thread again, what's being described here looks much more like an iterator than a generator. To me the essential characteristic of a generator is that it's written like a coroutine; a function that yields result items one at a time.

Yes, but not every kind of iterator.

We could think of a better name, still the main characteristics of the proposed record type is that it only produces a current value one at a time and doesn't need any materialized backing collection.

This is essentially a generator without requiring the traditional language support for yield return and/or coroutines.

If a specific language exposed its class implementing a generator, this class would greatly resemble what is proposed here.

rhdunn commented 1 year ago

@michaelhkay Yes. The record interface defined here is an iterator. What we (Dimitre and I) are proposing is in effect lazy or dynamic sequences where the (possibly infinite) values are computed on demand.

If we just define the iterator-style record then these won't be immediately useful -- they would be something that could be implemented as an external library. For them to be useful, they need to be usable within XPath, XSLT, and XQuery expressions as sequences.

Thus, with my proposed fn:sequence function in #708, they work like they also have a Java Iterable or Kotlin Sequence (which is used as the basis of Kotlin's Sequences machinery which is similar to XPath sequences).

michaelhkay commented 1 year ago

But I wonder also whether we could find a way of writing iterators in generator/coroutine programming style, as a function which "yields" items one at a time? There's a relationship here with the work we are doing on asynchronous functions (for SaxonJS, primarily).

Of course you can write a function that just returns a sequence, and rely on lazy evaluation to deliver it incrementally. But there's a difference because we're talking here about sequences that only support incremental forwards processing.

We've got a lot of internal machinery for this kind of thing in Saxon. For example it's a convention that system functions only read supplied sequences once, in a forwards direction, which means the caller can supply an iterator; user-written functions however can do anything they like with supplied sequences, which means they have to be re-readable. (Except when we can see that's not needed, which only happens if a function is inlined.)

We also have the ability to execute functions in either pull or push mode, which means they are very like coroutines already.

ChristianGruen commented 1 year ago

I assume the usual way to work with generators will be to feed them into recursive functions, folds or iterate-while loops:

declare function local:iterate($gen, $n) {
  if($n > 0) then (
    local:iterate($gen?next(), $n - 1)
  ) else (
    $gen
  )
};
local:iterate($generator, $n)?value()

fold-left(
  1 to 5,
  $generator,
  fn($gen) { $gen?next() }
)?value()

iterate-while(
  $generator,
  fn($gen, $pos) { $pos < 5 },
  fn($gen) { $gen?next() }
)?value()

If we don’t expect the generator to collect the result, or if we are not only interested in the value that has been generated last, it could be:

declare function local:iterate($gen, $n) {
  if($n > 0) then {
    $gen?value(),
    let $next-gen = $gen?next()
    where exists($next-gen)
    return local:iterate($next-gen, $n - 1)
  }
};
local:iterate($generator, $n)

…which doesn’t look very intuitive, though.

Due to the immutable nature of generators (creating new instances with every next() invocation), I can’t imagine how they could be applicable to classical FLWOR expressions, or user-defined functions accepting arbitrary sequences, even if an implementation processes its arguments lazily.

rhdunn commented 1 year ago

I think it would make sense to keep this issue about the record-based interface. The discussion on coroutines, yield, etc. should be a separate issue.

With my fn:sequence proposal, you could write, e.g.:

    let $gen := fn:random-number-generator() (: or other lazy sequence :)
    for $value in fn:sequence($gen)[1 to 10]
    return math:sin($value)
dnovatchev commented 1 year ago

To add to my last comment:

If we don't like the name "generator" then a more precise name would be enumerator.

It has exactly the same functionality as a .NET enumerator, and doesn't need to have a storage-backed collection, it just produces the next value of the enumeration on each call.

Do note that "enumerable" is applicable not only to finite collections, but also to infinite ones such as 1 to Infinity - a well-known fact from Math that the set of natural numbers (or integers) is "enumerable".

This is why in .NET an IEnumerable has a method ToList and ToArray)) (and many other ToXXX methods) if we want to get the entire collection in-memory.

This is why Visual Studio warns us when we use a method like Count on an IEnumerable that "Calling this method will enumerate the entire collection".

It must be obvious by now that the concept of Enumeration has an immense value and is at the heart of .NET and LINQ.

We can have the same Enumeration here in XPath, as specified in the current proposal.

Let's just do our job.

michaelhkay commented 1 year ago

Certainly Java and C# get great value from the Iterator/Enumerator concept.

However, they are procedural stateful languages, and therefore have limited scope for lazy evaluation. Many of the benefits of iterators/enumerators (specifically, not having to materialise a collection in memory) can be achieved in XPath by using lazy evaluation. That doesn't mean there are no benefits in adding iterators or enumerators to XPath, it just means that they are needed much less often.

dnovatchev commented 1 year ago

Certainly Java and C# get great value from the Iterator/Enumerator concept.

However, they are procedural stateful languages, and therefore have limited scope for lazy evaluation. Many of the benefits of iterators/enumerators (specifically, not having to materialise a collection in memory) can be achieved in XPath by using lazy evaluation. That doesn't mean there are no benefits in adding iterators or enumerators to XPath, it just means that they are needed much less often.

Definitely!

Thus we do agree that there are certain cases (though not so numerous as in C#/Java) where the Enumerator is needed in XPath.

Especially due to the fact that "using lazy evaluation in XPath" cannot be turned on or off by the language and is entirely dependent on a particular implementation.

As for whether "the benefits of iterators/enumerators (specifically, not having to materialise a collection in memory) " ... "are needed much less often", this is a very subjective judgement and can vary from one XPath user to the next.

With the major use-cases identified in this proposal, let's do our job in designing and providing in the Spec the feature that millions of programmers in other languages have and enjoy in their everyday experience.

ChristianGruen commented 1 year ago

@dnovatchev Generators are an interesting concept, no doubt. I still need to understand now how you believe they could be applied in practice. Would it be similar to what I drafted in my comment above, https://github.com/qt4cg/qtspecs/issues/716#issuecomment-1812137554 ?

dnovatchev commented 1 year ago

@dnovatchev Generators are an interesting concept, no doubt. I still need to understand now how you believe they could be applied in practice. Would it be similar to what I drafted in my comment above, #716 (comment) ?

@ChristianGruen Yes, generators will be used with folds, iterate-while, recursive functions and other functions on generators.

As for the code supplied in your comment, I (as often happens with such code) don't understand its purpose and meaning. so a verbal explanation would be helpful.

Included in the opening comment of this proposal there are a big number of examples of useful functions on generators, so we could focus on them as needed and well understood.

Any additional questions are welcome 😄

ChristianGruen commented 1 year ago

As for the code supplied in your comment, I (as often happens with such code) don't understand its purpose and meaning. so a verbal explanation would be helpful. Any additional questions are welcome 😄

The 3 code snippets in my comment were supposed to return the first $n values of a generator.

But maybe I’ll ask differently to get closer to your proposal: How would the following code need to be rewritten if $seq was to be replaced by a generator?

for $item in $seq[position() < 5]
return '• ' || $item
dnovatchev commented 1 year ago

But maybe I’ll ask differently to get closer to your proposal: How would the following code need to be rewritten if $seq was to be replaced by a generator?

for $item in $seq[position() < 5]
return '• ' || $item

The take function on generators as specified in the proposal, returns an array. thus:

gen:take($gen, 4) => array:for-each($x -> '* ' || $x)
dnovatchev commented 1 year ago

But maybe I’ll ask differently to get closer to your proposal: How would the following code need to be rewritten if $seq was to be replaced by a generator?

for $item in $seq[position() < 5]
return '• ' || $item

The take function on generators as specified in the proposal, returns an array. thus:

gen:take($gen, 4) => array:for-each($x -> '* ' || $x)

If we define the take function on generators to return another generator (as done in .NET), instead of an array, then the code will be:

gen:take($gen, 4)  => gen:for-each($x -> '* ' || $x)

We can do any operation within generators only, if we want.

Very smooth, short and intuitive, same as with .NET LINQ.

ChristianGruen commented 1 year ago

Thanks, that was helpful. So I believe I understand what you're proposing as an addition to the spec would be:

How would a generator be constructed?

let $gen := (:
  what would need to be placed here,
  e.g. for a generator that returns odd numbers?
:)
return gen:take($gen, 4) 
    => gen:for-each(fn($x) { '* ' || $x })
rhdunn commented 1 year ago

@ChristianGruen I wrote a generator/enumerator/lazy list for a random number generator in https://github.com/qt4cg/qtspecs/issues/708#issuecomment-1727179443 using value/next properties. Adapting to Dimitre's record properties, it could be something like (which I haven't tested):

xquery version "3.1";

(: https://en.wikipedia.org/wiki/Linear_congruential_generator :)
declare function local:lcg($seed, $a, $b, $m) {
    let $value := ($a * $seed + $b) mod $m
    return map {
        "initialized": true(),
        "endReached": false(),
        "getCurrent": function () { $value },
        "moveNext": function () {
            local:lcg($value, $a, $b, $m)
        }
    }
};

(: ZX81 random number generator LCG values :)
let $gen := local:lcg(100, 75, 74, math:pow(2, 16) + 1)
return gen:take($gen, 4) 
    => gen:for-each(fn($x) { '* ' || $x })
dnovatchev commented 1 year ago

Thanks, that was helpful. So I believe I understand what you're proposing as an addition to the spec would be:

  • a new generator record type and
  • a set of functions (possibly in a gen-prefixed namespace) that could basically be used as an alternative to sequences, arrays and maps.

How would a generator be constructed?

let $gen := (:
  what would need to be placed here,
  e.g. for a generator that returns odd numbers?
:)
return gen:take($gen, 4) 
    => gen:for-each(fn($x) { '* ' || $x })

The wanted code:

let $gen := generator{
                   initialized: true(), 
                   endReached: false(), 
                   getCurrent: fn() {$this.currentValue}, 
                   moveNext: fn($this)  {map:put($this, "currentValue", $this.currentValue + 2)},
                   currentValue: 1
}
 return 
     gen:for-each(fn($x) { '* ' || $x) => gen:subrange(1, 4)

We have a good choice for the return expression:

 `gen:for-each(fn($x) { '* ' || $x) => gen:subrange(1, 4)`

or

   `gen:for-each(fn($x) { '* ' || $x) => gen:take(4)`

or

 `gen:take(4) => gen:for-each(fn($x) { "* " || $x)` 

or

 `gen:subrange(1, 4) => gen:for-each(fn($x) { '* ' || $x)` 
ChristianGruen commented 1 year ago

ZX81

Nice, the one I started with… @rhdunn Thanks for the adapted generator example.

And thanks @dnovatchev. So in addition the proposal would include…

Which of the generator record keys are mandatory, and which are optional?

dnovatchev commented 1 year ago

ZX81

Nice, the one I started with… @rhdunn Thanks for the adapted generator example.

And thanks @dnovatchev. So in addition the proposal would include…

  • the definition of a generator constructor, which contains an initial value, and

The proposal will not include any additional information - users are free to provide whatever generators they need. We may have useful examples, though

Yes, I think the current record definition allows accessing other members from the record

Which of the generator record keys are mandatory, and which are optional?

The 4 key values that define a map as a generator-record are clearly specified in the proposal. One can just read:

let $generator as record
                   (initialized as xs:boolean,
                    endReached as xs:boolean,
                    getCurrent as function(..) as item()*,
                    moveNext as function(..) as .. ,
                    *  ) 
ChristianGruen commented 1 year ago

The proposal will not include any additional information

I got it. So instead of generator { ... } one would probably use the existing map constructor?

map {
  initialized: true(), 
  endReached: false(), 
  getCurrent: fn() { $this.currentValue }, 
  moveNext: fn($this) { map:put($this, "currentValue", $this.currentValue + 2) },
  currentValue: 1
}

Yes, I think the current record definition allows accessing other members from the record

I may have missed that. Do you have a link?

The 4 key values that define a map as a record are clearly specified in the proposal. One can just read:

I saw that indeed, but I was e.g. surprised by this:

initialized is a boolean. When a generator $gen is initially instantiated, $gen?initialized is false()

In your example, it was true(), so isn't it up to the user after all (as you said that “users are free to provide whatever generators they need”)?

Sorry for all the questions, it's not to criticize the proposal, I just want understand more of the nifty details.

dnovatchev commented 1 year ago

In your example, it was true(), so in't it up to the user after all (as you said that “users are free to provide whatever generators they need”)?

Yes, but when a function is passed a generator as argument, it shouldn't rely on it already being initialized.

In certain cases the user may create and initialize a generator, but in other cases they may decide not to do the initialization immediately.

dnovatchev commented 1 year ago

Yes, I think the current record definition allows accessing other members from the record

I may have missed that. Do you have a link?

I cannot find this immediately. I think this was in an issue proposed by @michaelhkay in which members of a record can call each other recursively.

rhdunn commented 1 year ago

@ChristianGruen @dnovatchev The https://github.com/qt4cg/qtspecs/issues/720 proposal suggests a $this implicit variable bound to the record instance in a function call on the record instance.

dnovatchev commented 1 year ago

@ChristianGruen @dnovatchev The #720 proposal suggests a $this implicit variable bound to the record instance in a function call on the record instance.

I think that we need to finalize #720, as it would provide the user with great power and convenience.

dnovatchev commented 1 year ago

Update:

The main change in the examples is that now the take function produces a generator and not an array as before.

Also introducing for convenience an emptyGenerator:

let $emptyGenerator := map{
                    initialized : true(),
                    endReached: true(),
                    getCurrent: function($this as map(*)) {error()},
                    moveNext: function($this as map(*)) {error()}
                   }

Here is the take function:

take($gen as generator, $n as xs:integer) as generator
{
  let $gen := if(not($gen?initialized)) then $gen?moveNext()
                      else $gen,
      return
         if( $gen?endReached or $n eq 0) then emptyGenerator()
            else map:put($gen, "moveNext", take($gen?moveNext($gen) ), $n -1  )
}

Also updated the opening proposal with the above two functions.

dnovatchev commented 1 year ago

Update:

Here are some other useful functions on generators -- also added to the originating proposal -- with just their signature and summary:


These and many other useful functions on generators can and should be added to every generator upon construction.

Thus, it would be good to have an explicit constructor function for a generator:

     construct-generator($record as 
                               record( initialized as xs:boolean,
                                       endReached as xs:boolean,
                                       getCurrent as function(..) as item()*,
                                       moveNext as function(..) as .. ,
                                      )
                         ) as generator
MarkNicholls commented 11 months ago

I like this proposal, but I'm not sure if the interface needs to be as complex as it is (as others have mentioned).

I prefer interfaces to be total, without errors, if they can be avoided.

My suspicion in the IEnumerable may be the way it is for historical reasons when the world was imperative and may even go back to VB6.

My knowledge of writing anything but trivial XPath is a curse, but also a liberation.

But doesnt effectively the list functional "pattern match" like function suffice?

in psuedo code.

let (maybeHead,tail) = headTail(iterator) i.e. the (haskell/ML/OCaml/F#) functional type of the method (you could in fact implement it as a single partial function)

headTail : iterator a -> (maybe a, iterator a) or in C#

(Option<A>,Iterator<A>) HeadTail<A>(Iterator<A> iterator) it has the advantage of

whether that sits well in an XPath context, I can't comment.

To compare this construct directly with IEnumerable and IEnumerator in dotnet I knocked this up in F# (where object expressions are quite similar to the Map constructs used in the XPath code, I'm hoping if you squint it makes sense to non ML speakers).

So this is the object interface I'm suggesting may be good enough,

type Iterator<'a> = 
    abstract member headTail : unit -> ('a option * Iterator<'a>)

this code makes an Iterator out of a List

module List = 
    let rec makeIterator : List<'a> -> Iterator<'a> = 
        fun xs -> 
            {
                new Iterator<'a> with
                    member _.headTail () = 
                        match xs with
                        | [] -> 
                            (None, makeIterator [])
                        | head :: tail ->
                            (Some head, makeIterator tail)
            }

this makes an iterator out of an IEnumerable (known as 'seq' in F#)

module Seq = 
    open System.Collections.Generic

    let makeIterator : 'a seq -> Iterator<'a> = 
        let rec makeIteratorFromIterator : IEnumerator<'a> -> Iterator<'a> = 
            fun xs -> 
                {
                    new Iterator<'a> with
                        member _.headTail () = 
                            match xs.MoveNext() with
                            | true -> 
                                (Some xs.Current,makeIteratorFromIterator xs)
                            | _ ->
                                (None,makeIteratorFromIterator xs)
                }
        fun xs -> 
            makeIteratorFromIterator (xs.GetEnumerator())

this demonstrates how to write a loop (i.e. fold on it)

module Iterator = 
    let rec fold : ('state -> 'a -> 'state) -> 'state -> Iterator<'a> -> 'state = 
        fun folder seed iterator ->
            let (headMaybe,tail) = iterator.headTail ()
            match headMaybe with
            | Some a ->
                fold folder (folder seed a) tail
            | None -> 
                seed

this uses an iterator to fold a list

let 6 = Iterator.fold (+) 0 (List.makeIterator [1;2;3])

this uses an iterator to fold an IEnumerable

let 6 = Iterator.fold (+) 0 (Seq.makeIterator (seq { 1;2;3 }))

so we can see that basically I can turn an IEnumerable into an Iterator

but I can also change it back (but note the awful pain...nasty mutable vars doing things I barely understand, basically a bit of a mess).

    let toEnumerator : Iterator<'a> -> IEnumerator<'a>  = 
        fun iterator ->
            let (maybeHeadI,tailI) = iterator.headTail ()
            let mutable maybeHead = ref maybeHeadI
            let mutable tail = ref tailI
            {
                new IEnumerator<'a> with
                    member _.Dispose() = 
                        ()
                    member _.Current 
                        with get () = 
                        (!maybeHead).Value : obj
                    member _.Current 
                        with get () = 
                        (!maybeHead).Value
                    member _.MoveNext () =
                        let (maybeHeadI,tailI) = iterator.headTail ()
                        maybeHead := maybeHeadI
                        tail := tailI
                        not(!maybeHead = None)
                    member _.Reset () =
                        let (maybeHeadI,tailI) = iterator.headTail ()
                        maybeHead := maybeHeadI
                        tail := tailI
                        ()
            }

and here is the IEnumerable factory method

    let toSeq : Iterator<'a> -> IEnumerable<'a> = 
        fun iterator ->
            {
                new IEnumerable<'a> with
                    member _.GetEnumerator () : IEnumerator<'a> =
                        toEnumerator iterator 
                    member _.GetEnumerator () : System.Collections.IEnumerator =
                        toEnumerator iterator 
            }

I now we can use the implementation of fold in F# defined on IEnumerable

let iterator = List.toSeq [1;2;3]
let 6 = Seq.fold (+) 0 iterator

So at least in F# these two constructs seem to be isomorphic, but the Iterator<'a> construct seems much much simpler and easy to use...in an F# context

whether that simplicity translates in XPath, I can't comment, bit to me it seems simpler.

even simpler it can be defined as a single function, but I think it it may obscure the correlation to the Map constructs in the XPath code.

type IteratorFunction<'a> = IteratorFunction of (unit -> ('a option * IteratorFunction<'a>)) with
    member this.run () =
        let (IteratorFunction f) = this
        f()

module IteratorFunction = 
    let rec fromList : List<'a> -> IteratorFunction<'a> =
        fun xs ->
            let next () = 
                match xs with
                | [] -> 
                    (None, fromList [])
                | head :: tail ->
                    (Some head, fromList tail)
            IteratorFunction next                    

    let rec fold : ('state -> 'a -> 'state) -> 'state -> IteratorFunction<'a> -> 'state = 
        fun folder seed iterator ->
            let (headMaybe,tail) = iterator.run ()
            match headMaybe with
            | Some a ->
                fold folder (folder seed a) tail
            | None -> 
                seed

let 6 = IteratorFunction.fold (+) 0 (IteratorFunction.fromList [1;2;3])

If you wish to give programmers an interface like IEnumerable (i.e. the one proposed) you CAN do so, you simply include these functions in the set of "useful" Iterator functions (rather than methods) and use the code laid out to turn an Iterator into an IEnumerable, you then only need to implement a single function per collection, to use this and any other function defined on it.

dnovatchev commented 11 months ago

If you wish to give programmers an interface like IEnumerable (i.e. the one proposed) you CAN do so, you simply include these functions in the set of "useful" Iterator functions (rather than methods) and use the code laid out to turn an Iterator into an IEnumerable, you then only need to implement a single function per collection, to use this and any other function defined on it.

Yes this is the idea.

Sorry, I don't know F# and cannot understand what the provided F# code does.

Also, we don't have methods in XPath - just functions. But we can have a set of functions all of which belong to the same (XML) namespace.