Open shardulc opened 11 months ago
I would like to propose some changes to the documentation of fold{,Left,Right}
and reduce{,Option,Left,Right}
that clarifies their behavior when the collection is not ordered or when the operator is not associative or commutative. Currently, these methods warn that the result may be different for different runs, unless some conditions are met. fold
, reduce
, and reduceOption
even require that their operator argument be associative. If their conditions are unmet, these three methods do not specify their behavior at all, and the other four (foldLeft
et al.) specify it only weakly. I think they should all specify their baseline behavior and give stronger guarantees when their conditions are met.
I am not proposing any change to the behavior of these methods, only to their documentation.
fold
, reduce
, and reduceOption
Documentation of reduce
excerpted as a reference example:
Reduces the elements of this collection using the specified associative binary
operator.
The order in which operations are performed on elements is unspecified and may
be nondeterministic.
Value parameters
`op` A binary operator that must be associative.
Attributes
Returns The result of applying reduce operator `op` between all the
elements if the collection is nonempty.
This leaves the behavior unspecified if the operator is not associative, and is not very clear about the order of operations when the collection is ordered. Instead, my proposed specification of reduce
:
reduce
returns the result of repeatedly applying the operator, where each operand is either an element of the collection or another such application of the operator, such that each element appears exactly once in the computation. The order of applications of the operator is unspecified and may be nondeterministic.fold
is similar, with additional conditions about the initial value. My proposal:
fold
returns the result of repeatedly applying the operator, where each operand is either an element of the collection, the initial value, or another such application of the operator, such that each element appears exactly once in the computation. The order of applications of the operator is unspecified and may be nondeterministic. The initial value may be used an arbitrary number of times.foldLeft
, foldRight
, reduceLeft
, and reduceRight
Documentation of reduceLeft
excerpted as a reference example:
Applies a binary operator to all elements of this collection, going left to
right.
Note: will not terminate for infinite-sized collections.
Note: might return different results for different runs, unless the underlying
collection type is ordered or the operator is associative and commutative.
Value parameters
`op` the binary operator.
Attributes
Returns the result of inserting op between consecutive elements of this
collection, going left to right:
`op( op( ... op(x1, x2) ..., xn-1), xn)`
where `x1, ..., xn` are the elements of this collection.
First, we make the same observation as for reduce
, that the behavior is underspecified if the operator is not associative or commutative and the collection is not ordered. We start with the same proposed specification as for reduce
.
Second, the Returns
attribute points at the key additional guarantee reduceLeft
provides over reduce
, but does not explicitly put it into words. This guarantee is that the right operand will always be an element of the collection. Then, foldLeft
adds yet another guarantee, which is that the initial value will be used exactly once as the leftmost operand. This specification is symmetric for the -Right
methods.
Third, associativity is no longer needed for a deterministic result if the collection is ordered since the order of operations is fully specified. For the fold
- methods, it is no longer needed for the initial value to be a unit since it will always be used exactly once in a fully specified position.
A use case that actually came up recently for some colleagues and me was the following. We have an inductive binary tree type:
enum Tree[T]:
case Leaf(value: T)
case Node(left: Tree[T], right: Tree[T])
And we have methods for insertion, balancing, etc. But to quickly construct a tree from a Seq[T]
of elements, we wanted to do:
val tree: Tree[T] = elems.map(Leaf).reduce(Node)
But we realized that the behavior of reduce
is unspecified since Node
is not associative. However, we explicitly don't care about the association of Node
. It doesn't matter if it creates a line tree (as it would with reduceLeft
) or a balanced tree (as it would if the collection were recursively halved) as we can address that afterwards if needed. All that matters is that we have the result of applying Node
to all the Leaf
s in some order.
I did some git digging to see where the “must be associative” docstrings came from. As far as I can tell, they were introduced in 2010 in e67f560766 “Adding parallel collections to trunk.” Sequential collections at that point had only reduceLeft
et al., not reduce
. Subsequent commits (e.g. 3de96153e5, 7237732306f, 4b21b76435) created common abstractions for collections, preserving the parallel collections’ documentation for reduce
that sequential collections were now inheriting. The commit messages do not seem to discuss this underspecification issue for either reduce
or reduceLeft
.
It makes sense to me that this specification was associated with parallelism, because in the presence of parallelism, one might want to give up the exact order in which the reduction operator is applied in exchange for the collection being processed faster in parallel. It also makes sense that the laxer specification was kept as-is for sequential collections instead of special-casing and trying to nail down the sequential behavior. But I think that the original specification was too lax even for parallel collections, and that the pre-existing sequential specification was also too lax to a lesser degree.
Summary
fold
inStringOps
andArrayOps
, andfold
,reduce
, andreduceOption
inIterableOnce
, require that their operator argument be associative—at least according to their documentation. Their behavior is unspecified if the operator is not associative. On the other hand, methods likereduceLeft
specify their behavior (“order of operations may be nondeterministic”) if associativity or other conditions are not satisfied. I would like to propose a similar clarification tofold
et al.Details
Documentation of
reduce
excerpted for reference:Note that this leaves the behavior unspecified if the operator is not associative.
As for
reduceLeft
:A careful reading of the above reveals that even if the operator is not associative or commutative, and the collection is not ordered, the output of this method will be the result of applying
op
to some two elements of the collection, and then to that and some third, then that and a fourth, and so on.I would like to propose a similar clarification to the documentation of
reduce
andreduceOption
. The output will always be the result of applying the operator to the elements in arbitrary order with arbitrary association. This result may be different for different runs unless:fold
is similar, with the additional condition that the initial value is a unit for the operator. Again, if it is not a unit, the result may be different for different runs, but it will always be the result of applying the operator to the elements or the initial value in arbitrary order with arbitrary association where the initial value may be used an arbitrary number of times.Proposed documentation
For
reduce
:For
fold
:Why might we want this?
A use case that actually came up recently for some colleagues and me was the following. We have an inductive binary tree type:
And we have methods for insertion, balancing, etc. But to quickly construct a tree from a
Seq[T]
of elements, we wanted to do:But we realized that the behavior of
reduce
is unspecified sinceNode
is not associative. However, we explicitly don't care about the association ofNode
. It doesn't matter if it creates a line tree (as it would withreduceLeft
) or a balanced tree (as it would if the collection were recursively halved) as we can address that afterwards if needed. All that matters is that we have the result of applyingNode
to all theLeaf
s in some order.History
I did some git digging to see where the “must be associative” docstrings came from. As far as I can tell, they were introduced in 2010 in e67f560766 “Adding parallel collections to trunk.” Sequential collections at that point had only
reduceLeft
et al., notreduce
. Subsequent commits (e.g. 3de96153e5, 7237732306f, 4b21b76435) created common abstractions for collections, preserving the parallel collections’ documentation forreduce
that sequential collections were now inheriting. The commit messages do not seem to discuss this underspecification issue.It makes sense to me that this specification was associated with parallelism, because in the presence of parallelism, one might want to give up the exact order in which the reduction operator is applied in exchange for the collection being processed faster in parallel. It also makes sense that the laxer specification was kept as-is for sequential collections instead of special-casing and trying to nail down the sequential behavior. But I think that the original specification was too lax even for parallel collections.