shardulc / scala

Scala 2 compiler and standard library. Bugs at https://github.com/scala/bug; Scala 3 at https://github.com/lampepfl/dotty
https://www.scala-lang.org/
Apache License 2.0
0 stars 0 forks source link

`reduce` unspecified when operator not associative #1

Open shardulc opened 11 months ago

shardulc commented 11 months ago

Summary

fold in StringOps and ArrayOps, and fold, reduce, and reduceOption in IterableOnce, require that their operator argument be associative—at least according to their documentation. Their behavior is unspecified if the operator is not associative. On the other hand, methods like reduceLeft specify their behavior (“order of operations may be nondeterministic”) if associativity or other conditions are not satisfied. I would like to propose a similar clarification to fold et al.

Details

Documentation of reduce excerpted for reference:

Reduces the elements of this collection using the specified associative binary
operator.

The order in which operations are performed on elements is unspecified and may
be nondeterministic.

Value parameters
  `op`        A binary operator that must be associative.

Attributes
  Returns     The result of applying reduce operator `op` between all the
              elements if the collection is nonempty.

Note that this leaves the behavior unspecified if the operator is not associative.

As for reduceLeft:

Applies a binary operator to all elements of this collection, going left to
right.

Note: will not terminate for infinite-sized collections.

Note: might return different results for different runs, unless the underlying
collection type is ordered or the operator is associative and commutative.

Value parameters
  `op`          the binary operator.

Attributes
  Returns       the result of inserting op between consecutive elements of this
                collection, going left to right:
                    `op( op( ... op(x1, x2) ..., xn-1), xn)`
                where `x1, ..., xn` are the elements of this collection.

A careful reading of the above reveals that even if the operator is not associative or commutative, and the collection is not ordered, the output of this method will be the result of applying op to some two elements of the collection, and then to that and some third, then that and a fourth, and so on.

I would like to propose a similar clarification to the documentation of reduce and reduceOption. The output will always be the result of applying the operator to the elements in arbitrary order with arbitrary association. This result may be different for different runs unless:

fold is similar, with the additional condition that the initial value is a unit for the operator. Again, if it is not a unit, the result may be different for different runs, but it will always be the result of applying the operator to the elements or the initial value in arbitrary order with arbitrary association where the initial value may be used an arbitrary number of times.

Proposed documentation

For reduce:

Reduces the elements of this collection using the specified binary operator.

The order in which operations are performed on elements is unspecified and may
be nondeterministic. Thus, the result may be different for different runs,
unless: (1) `op` is associative and commutative; or (2) `op` is associative and
the collection is ordered.

Value parameters
  `op`        A binary operator.

Attributes
  Returns     The result of applying reduce operator `op` between all the
              elements if the collection is nonempty.

For fold:

Folds the elements of this collection using the specified binary operator. The
default implementation in `IterableOnce` is equivalent to `foldLeft` but may be
overridden for more efficient traversal orders.

The order in which operations are performed on elements is unspecified and may
be nondeterministic. The initial element `z` may be used as an operand an
arbitrary number of times. Thus, the result may be different for different runs,
unless: (1)(i) `op` is associative and commutative; or (ii) `op` is associative 
and the collection is ordered; and (2) `z` is a neutral element, a.k.a. unit,
for `op` (e.g. `Nil` for list concatenation, `0` for addition, or `1` for
multiplication).

Note: will not terminate for infinite-sized collections.

Value parameters
  `op`      a binary operator.
  `z`       an initial element for the fold operation; may be used as an operand
            an arbitrary number of times.

Attributes
  Returns   the result of applying the fold operator `op` between all the
            elements and an arbitrary number of instances of `z`, or `z` if this
            collection is empty.

Why might we want this?

A use case that actually came up recently for some colleagues and me was the following. We have an inductive binary tree type:

enum Tree[T]:
  case Leaf(value: T)
  case Node(left: Tree[T], right: Tree[T])

And we have methods for insertion, balancing, etc. But to quickly construct a tree from a Seq[T] of elements, we wanted to do:

val tree: Tree[T] = elems.map(Leaf).reduce(Node)

But we realized that the behavior of reduce is unspecified since Node is not associative. However, we explicitly don't care about the association of Node. It doesn't matter if it creates a line tree (as it would with reduceLeft) or a balanced tree (as it would if the collection were recursively halved) as we can address that afterwards if needed. All that matters is that we have the result of applying Node to all the Leafs in some order.

History

I did some git digging to see where the “must be associative” docstrings came from. As far as I can tell, they were introduced in 2010 in e67f560766 “Adding parallel collections to trunk.” Sequential collections at that point had only reduceLeft et al., not reduce. Subsequent commits (e.g. 3de96153e5, 7237732306f, 4b21b76435) created common abstractions for collections, preserving the parallel collections’ documentation for reduce that sequential collections were now inheriting. The commit messages do not seem to discuss this underspecification issue.

It makes sense to me that this specification was associated with parallelism, because in the presence of parallelism, one might want to give up the exact order in which the reduction operator is applied in exchange for the collection being processed faster in parallel. It also makes sense that the laxer specification was kept as-is for sequential collections instead of special-casing and trying to nail down the sequential behavior. But I think that the original specification was too lax even for parallel collections.

shardulc commented 11 months ago

Summary

I would like to propose some changes to the documentation of fold{,Left,Right} and reduce{,Option,Left,Right} that clarifies their behavior when the collection is not ordered or when the operator is not associative or commutative. Currently, these methods warn that the result may be different for different runs, unless some conditions are met. fold, reduce, and reduceOption even require that their operator argument be associative. If their conditions are unmet, these three methods do not specify their behavior at all, and the other four (foldLeft et al.) specify it only weakly. I think they should all specify their baseline behavior and give stronger guarantees when their conditions are met.

I am not proposing any change to the behavior of these methods, only to their documentation.

Details

fold, reduce, and reduceOption

Documentation of reduce excerpted as a reference example:

Reduces the elements of this collection using the specified associative binary
operator.

The order in which operations are performed on elements is unspecified and may
be nondeterministic.

Value parameters
  `op`        A binary operator that must be associative.

Attributes
  Returns     The result of applying reduce operator `op` between all the
              elements if the collection is nonempty.

This leaves the behavior unspecified if the operator is not associative, and is not very clear about the order of operations when the collection is ordered. Instead, my proposed specification of reduce:

fold is similar, with additional conditions about the initial value. My proposal:

foldLeft, foldRight, reduceLeft, and reduceRight

Documentation of reduceLeft excerpted as a reference example:

Applies a binary operator to all elements of this collection, going left to
right.

Note: will not terminate for infinite-sized collections.

Note: might return different results for different runs, unless the underlying
collection type is ordered or the operator is associative and commutative.

Value parameters
  `op`          the binary operator.

Attributes
  Returns       the result of inserting op between consecutive elements of this
                collection, going left to right:
                    `op( op( ... op(x1, x2) ..., xn-1), xn)`
                where `x1, ..., xn` are the elements of this collection.

First, we make the same observation as for reduce, that the behavior is underspecified if the operator is not associative or commutative and the collection is not ordered. We start with the same proposed specification as for reduce.

Second, the Returns attribute points at the key additional guarantee reduceLeft provides over reduce, but does not explicitly put it into words. This guarantee is that the right operand will always be an element of the collection. Then, foldLeft adds yet another guarantee, which is that the initial value will be used exactly once as the leftmost operand. This specification is symmetric for the -Right methods.

Third, associativity is no longer needed for a deterministic result if the collection is ordered since the order of operations is fully specified. For the fold- methods, it is no longer needed for the initial value to be a unit since it will always be used exactly once in a fully specified position.

Why might we want this?

A use case that actually came up recently for some colleagues and me was the following. We have an inductive binary tree type:

enum Tree[T]:
  case Leaf(value: T)
  case Node(left: Tree[T], right: Tree[T])

And we have methods for insertion, balancing, etc. But to quickly construct a tree from a Seq[T] of elements, we wanted to do:

val tree: Tree[T] = elems.map(Leaf).reduce(Node)

But we realized that the behavior of reduce is unspecified since Node is not associative. However, we explicitly don't care about the association of Node. It doesn't matter if it creates a line tree (as it would with reduceLeft) or a balanced tree (as it would if the collection were recursively halved) as we can address that afterwards if needed. All that matters is that we have the result of applying Node to all the Leafs in some order.

History

I did some git digging to see where the “must be associative” docstrings came from. As far as I can tell, they were introduced in 2010 in e67f560766 “Adding parallel collections to trunk.” Sequential collections at that point had only reduceLeft et al., not reduce. Subsequent commits (e.g. 3de96153e5, 7237732306f, 4b21b76435) created common abstractions for collections, preserving the parallel collections’ documentation for reduce that sequential collections were now inheriting. The commit messages do not seem to discuss this underspecification issue for either reduce or reduceLeft.

It makes sense to me that this specification was associated with parallelism, because in the presence of parallelism, one might want to give up the exact order in which the reduction operator is applied in exchange for the collection being processed faster in parallel. It also makes sense that the laxer specification was kept as-is for sequential collections instead of special-casing and trying to nail down the sequential behavior. But I think that the original specification was too lax even for parallel collections, and that the pre-existing sequential specification was also too lax to a lesser degree.