Closed stellingsimon closed 7 years ago
I think A better design should be something like: StreamEx#collapse
StreamEx#collapse is something very different. In the order-line example, you'd lose all but the first order line of each order, which makes it completely unsuitable...
@stellingsimon take a look the docs or groupRuns and have a try. it keeps all the records, not just the first of each order
Thank you very much for your suggestion. I think we already cover this functionality through the various Seq.grouped()
overloads, right?
https://www.jooq.org/products/jOO%CE%BB/javadoc/latest/org/jooq/lambda/Seq.html#grouped-java.util.function.Function-
They transform a Seq<T>
into a Seq<Tuple2<T, Seq<T>>
.
TL;DR: After reading your latest comment in #296 I'm fine with marking this as a duplicate of #296 and sticking with the name chunk
.
Elaborated:
Thanks for getting back to me. I saw grouped
before, and it does produce the output that I am looking for. However, my concern with grouped
is that it will fully materialize (almost) all its inputs before the iterator for the first group reports that it has no more elements. As you said yourself in #296:
In other words:
- In order to determine the size of an individual group, we have to run through the entire Seq
- In order to determine the size of an individual chunk, we only have to encounter the next delimiter
Since the consumer (usually) will process the tuples in order, grouped
will necessarily buffer all elements of the second to the last group [1]. This is problematic in large streams of sizable objects. It is also avoidable if we know the inputs to be pre-sorted, thus the idea to provide explicit support for this in Seq's API. In relation to #296, I think that
Seq.seq(items)
.chunked(i -> i.getOrderId(), BEFORE)
.forEach( ... );
is not self-explanatory at all due to the Enum
argument. This issue was an attempt to provide an alternative solution that expresses the same operation using the client's vocabulary:
Seq.seq(items)
.groupBySorted(i -> i.getOrderId())
.forEach( ... );
[1]: grouped
implementation: https://github.com/jOOQ/jOOL/blob/master/src/main/java/org/jooq/lambda/Seq.java#L9463
Let me stress what I said in #296 a bit differently:
In order to determine the size of an individual group, we have to run through the entire Seq
But this doesn't mean that we have to traverse the entire Seq to start consuming a group. Quite possibly, the current implementation is not optimal / lazy enough - I think there's currently no test that checks for the laziness of this operation.
I'm fine with re-discussing the naming of BEFORE
in the relevant issue, although do note that neither is this name set in stone (or far beyond draft status), nor is it the fault of the designer if a user uses a static import ;) The fully qualified enum name could be something along the lines of IncludeDelimiterInChunks.BEFORE
. Or whatever.
Also, I don't think a groupBySorted
operation is meaningful. We should not add an operation that depends on a hopefully correct assumption by the developer, which in case it is incorrect, fails completely (or rather: reverts to the original name "chunked"). Besides, I would read your suggestion more like a SQL hint, indicating to the API that the grouping operation must be performed by sorting the Seq, not by using a hashmap, rather than a Stream state predisposition hint.
Having said so, I'm now convinced as well that this is a duplicate of either the existing grouped()
operation (which might be optimised) or the newly proposed chunked()
operation.
This is a feature request for a variant of
groupBy
that doesn't materialize the entireSeq
at once by exploiting pre-sortedness of the inputs.Consider the following scenario that I experienced in the past. From a relational DB, we stream a large number of order-lines. Because we've got a DB index already, sorting them by
order_id
on the DB is cheap and we do it there. The order-lines need to be processed in groups oforder_id
, however. UsinggroupBy
, we unnecessarily materialize the entire result set before starting to process the first group of order-lines. Exploiting the existing grouping, this could be avoided. An example:A new
groupBySorted
should return the followingSeq<Tuple2<Long, OrderLine>>
:Or, alternatively, a new
splitGroupsBy
should return the followingSeq<Seq<OrderLine>>
:This same operation can be used to implement
chunked(long)
as proposed in #320 without resorting to a statefulCountingPredicate
:or
Do you think either of these would be a useful addition? Which version do you prefer?
Edge cases: Given
groupBySorted
should produceor respectively,
splitGroupsBy
should produce