Aggregation or partition structure added to Iterables

GoogleCodeExporter commented 9 years ago

In ref to https://groups.google.com/d/topic/guava-discuss/NfE09gFfPjU/discussion

This issue discusses the addition of the following three methods to Iterables:

Iterable<List<T>> partition(Iterable<T> iterable, Comparator<T> comparator)
Iterable<List<T>> parition(Iterable<T> iterable, Equivalence<T> equivalence)
Iterable<List<T>> parition(Iterable<T extends Comparable<? super T>> iterable)

The proposed function of these methods would be to group or chunk like items 
together without disrupting the order of the elements or their occurrence in 
the underlying iterable or boundaries in the underlying iterable.

As a side-effect of this it would also be prudent to add something to 
Equivalences like Equivalence<T> from(Comparator<T> comparator) for 
completeness.

Original issue reported on code.google.com by emily@soldal.org on 23 May 2012 at 4:42

GoogleCodeExporter commented 9 years ago

This sounds very similar to Python's itertools.groupby [1]. +1 from me :)

[1] http://docs.python.org/library/itertools.html#itertools.groupby

Original comment by stephan...@gmail.com on 23 May 2012 at 10:11

GoogleCodeExporter commented 9 years ago

Jet-lagged brainwave:
These signatures would also naturally fit into FluentIterable, but in that 
instance groupBy() would seem to give more meaning.

Original comment by emily@soldal.org on 26 May 2012 at 1:04

GoogleCodeExporter commented 9 years ago

I would appreciate an aggregation method in Guava. I made my own utility class 
years ago, before Guava era. Let me share some observation from its usage and 
from reading the 
https://groups.google.com/forum/#!topic/guava-discuss/NfE09gFfPjU/discussion :

1. Multimaps.index is pretty close to aggregation method. Kevin's first entry 
in discussion perfectly fits to my usecases, however it always returns new 
multimap. Sometimes, for example when streaming large file, lightweight 
approach is more desirable. E.g. pass through iterable once and generate events 
like "open group" and "close group" to some builder, like for example during 
SAX xml parsing.

2. To use Equivalence for group boundary recognition is unnecessarily 
restrictive. The problem to be solved is just to decide if two neighboring 
elements are in the same group. The aggregation method would then be usable 
e.g. for finding continuous subsequences in list of integers [1,2,3,4,7,10,11] 
-> 1-4,7,10-11 or for some simple analysis. Hence Equivalence would be abused 
here since symmetry and transitivity would be violated. Predicate<Pair<T>> (if 
Guava had Pair) would IMHO be more appropriate. 

3. Aggregation is often multilevel and the function on which data is aggregated 
is usually the same as the function on which data is sorted. The API should 
facilitate both in some convenient way.

Example of my API (real use case, original entities from telco domain):

class Person {String surname; String name; int number; ...}
List<Person> persons = Lists.newArrayList(
        new Person("Doe","Jack",1), 
        new Person("Doe","Jack",2),
        new Person("Doe","Joe" ,1), 
        new Person("Doe","Joe" ,2),
        new Person("Doe","Joe" ,3),
        new Person("Doe","Joe" ,5),
        new Person("Doe","Joe" ,6),
        new Person("Doe","Joe" ,7),
        new Person("Doe","Joe" ,8),
        new Person("Roe","Jane",1),
        new Person("Roe","Jane",2),
        new Person("Roe","Jane",3),
        new Person("Roe","Jane",5)
);
Builder<Person,String> bySurname = new Builder<Person,String>() {
    public String key(Person current) {return current.surname;}
    public void openGroup(Person first) {System.out.println(first.surname);}
};
Builder<Person,String> byName = new Builder<Person,String>() {
    public String key(Person current) {return current.name;}
    public void openGroup(Person first) {System.out.println("+" + first.name);}
};
Builder<Person,Integer> byNumber = new Builder<Person,Integer>() {
    private int firstNumber;
    public Integer key(Person current) {return current.number;}
    public boolean isSameGroup(Integer previousExpression, Integer currentExpression) {return previousExpression + 1 == currentExpression;}
    public void openGroup(Person first) {System.out.print("++" + (firstNumber = first.number));}
    public void closeGroup(Person last) {System.out.println(firstNumber == last.number ? "" : "-" + last.number);}
};
Grouping.group(persons).by(bySurname,byName,byNumber);

result:

Doe
+Jack
++1-2
+Joe
++1-3
++5-8
Roe
+Jane
++1-3
++5

My dream Guava API for this would then look like

FluentIterable
    .from(persons)
    .groupBy(Person.GET_SURNAME)
    .groupBy(Person.GET_NAME)
    .groupBy(Person.GET_NUMBER,EQUIVALENCE_FOR_DIFFERENCE_1)

(Just a shot, I didn't researched it deeply yet.)

Original comment by tomas.za...@gmail.com on 29 May 2012 at 10:29

GoogleCodeExporter commented 9 years ago

@tomas, I think there are two different things going on here.

"Group boundary recognition" is a totally different problem than the one I 
think Emily's addressing -- in particular, there's the case when we want to 
group *contiguous* elements, and the case when we want to group 
not-necessarily-contiguous elements together (e.g. by an Equivalence).

It's not clear to me that a "One True Aggregation API" could even exist, but I 
think there are definite possibilities for improvement.

The idea playing around in my head at the moment is some sort of fluent 
"Grouping" API that takes an Iterable<E> and returns an Iterable<Iterable<E>>.  
Groupings might be contiguous, they might be non-contiguous, they might be 
based on a function or an ordering or an equivalence...am I making sense?

Original comment by wasserman.louis on 29 May 2012 at 10:43

GoogleCodeExporter commented 9 years ago

That is, let me clarify: I think groupings are too complicated to just get 
tacked on to e.g. FluentIterable; I think there are too many different ways to 
"group" to just have a couple of heavily overloaded methods.

Original comment by wasserman.louis on 29 May 2012 at 10:44

GoogleCodeExporter commented 9 years ago

A full API built as opposed to a single method on FluentIterable? Me gusta. 

Now that you mention it Louis, this does seem to make more sense. Since nested 
grouping using FluentIterable would result in Iterable<Iterable<Iterable...ad 
nausium, or at least it would risk it. Perhaps it could be solved by being able 
to combine Equivalences, when using a Comparator you could use 
Ordering.compound() to do this nesting for you. 

Sorting the data is actually not a prerequisit for grouping, even though the 
two are often together.

Original comment by emily@soldal.org on 29 May 2012 at 10:47

GoogleCodeExporter commented 9 years ago

+1 for Equivalences.intersect (if not the name, at least the functionality).

My concern with "aggregation" has always been that it's such an open-ended, 
complicated thing that encompasses so many different ideas that designing an 
API to deal with them all is *difficult.*  Just from this issue and off the top 
of my head, we have

* grouping by Equivalence (contiguous or not)
* grouping by HashFunction (?)
* grouping by Comparator (contiguous or not)
* grouping into chunks of K
* grouping into K chunks
* combinations of the above

There are a lot of different ways to take these ideas, and I'm still not sure 
we can fit them all into one unified API, but I think that a fluent "Grouping" 
API is the only way to even get close.

Original comment by wasserman.louis on 29 May 2012 at 11:14

GoogleCodeExporter commented 9 years ago

* grouping by Equivalence (contiguous or not) / grouping by Comparator 
(contiguous or not)
    - I feel like we should only concern outselves with contiguous variants (and I hope I am using that word correctly...), any preliminary grouping of elements should be done by the user and it would feel presumptious to simply assume that the data should be structured. We could sort it beforehand but then you loose a lot of performance if you present yourself as a view and then the user iterates multiple times.

* grouping by HashFunction (?)
    - Isn't this the same as a equivalence with a hashing function infront of it? Interesting idea though

* grouping into chunks of K / grouping into K chunks
    - K chunks in N is tricky as you will have to judge the size of the iterable before figuring out how big a chunk should be

Original comment by emily@soldal.org on 29 May 2012 at 11:30

GoogleCodeExporter commented 9 years ago

"I feel like we should only concern outselves with contiguous variants (and I 
hope I am using that word correctly...), any preliminary grouping of elements 
should be done by the user"

I'm not sure I buy this?  I would say that Multimaps.index does "aggregation," 
but it simply doesn't care what order the values come in, and I feel like 
that's a use case worth addressing -- especially for e.g. Equivalences where 
there isn't an obvious way to "sort."

That is, I think it's totally legit for a user to say, "I have these totally 
arbitrarily ordered values, and I want them grouped according to an 
Equivalence."

"grouping by HashFunction (?)
    - Isn't this the same as a equivalence with a hashing function infront of it? Interesting idea though"

Mostly I think of this as being like Multimaps.index, except using a custom 
hash function that might be more efficient/effective/have fewer collisions

"- K chunks in N is tricky as you will have to judge the size of the iterable 
before figuring out how big a chunk should be"
Indeed, that version would be strict -- and if we want to deal with 
noncontiguous groupings, those would be strict too.

Original comment by wasserman.louis on 29 May 2012 at 11:38

GoogleCodeExporter commented 9 years ago

Sorting by equivalence feels iffy to me, although I guess theres nothing 
strictly speaking wrong with it if its made strict? o_O 
You'd have to do some sort of intelligent sorting algorithm though, since you 
don't know if A is above or below B only that it is different. How would you 
sort a list where no elements are equal? Wouldn't it go into a loop?

Original comment by emily@soldal.org on 29 May 2012 at 11:45

GoogleCodeExporter commented 9 years ago

Most of what we've discussed so far seems to be related to the way you define a 
grouping rather than the operation of grouping itself. Even if we wish to do 
ordering, it occurs outside of the actual operation of grouping.

What if instead of a Grouping API we define a builder pattern for a grouping 
rule:

FluentIterable<List<Person>> groupedPeople = 
FluentIterable.from(people).grouping().by(Person.GET_SURNAME).by(Person.GET_NAME
).group();

or

FluentIterable<List<Person>> groupedPeople = 
FluentIterable.from(people).grouping().by(Person.GET_SURNAME).by(Person.GET_NAME
).sortedBy(Person.GET_SURNAME).group();

where sortedBy takes either a Comparator<T> or a Function<T,Comparable> which 
we can use with Ordering.natural().onResultOf()

Original comment by emily@soldal.org on 30 May 2012 at 8:38

GoogleCodeExporter commented 9 years ago

Hmmmm.  I guess my thinking is that having actual Grouping objects would be 
nice -- like, a static final constant Grouping that you reuse.

"Sorting by equivalence feels iffy to me, although I guess theres nothing 
strictly speaking wrong with it if its made strict?"

I'm not suggesting sorting at all -- the ordering would be by "first seen," so 
the first element in a new equivalence class would determine the position of 
that equivalence class in the ordering.

Original comment by wasserman.louis on 30 May 2012 at 8:44

GoogleCodeExporter commented 9 years ago

Or, heck, ordering by Equivalence could just give you an arbitrary ordering of 
the equivalence classes.

Original comment by wasserman.louis on 30 May 2012 at 8:45

GoogleCodeExporter commented 9 years ago

So Iterable<T> -> Iterable<Grouping<T>> where Grouping holds a Predicate<T> 
composed of the Equivalence.equivalentTo(T) for the first instance of T now in 
another Grouping ?

Original comment by emily@soldal.org on 30 May 2012 at 8:59

GoogleCodeExporter commented 9 years ago

Wait, what?

I was thinking that Grouping would represent more of a "grouping *rule*," and 
then you'd call Grouping.group(iterable) and get back an Iterable<Iterable<T>>. 
 So, something like

Grouping<PhoneBookRecord> SAME_PHONE_NUMBERS =
  Grouping.onEquivalence(Equivalences.equals()
    .onResultOf(GET_PHONE_NUMBER_FUNCTION));

Iterable<Iterable<PhoneBookRecord>>
  peopleUsingSamePhone = SAME_PHONE_NUMBERS.group(phoneBook);

Original comment by wasserman.louis on 30 May 2012 at 9:10

GoogleCodeExporter commented 9 years ago

Ah ok, gotcha.

Tangentially I found that Iterable<Iterable<T>> isn't really feasable if you 
wish to do it as a view of an Iterable, so you end up either producing a copy 
of the entier iterable or a copy of the particular grouping within the iterable.

Is the idea in the above example that Groupiing.group(Iterable<T>) produces a 
copy? Can we work this into FluentIterable somehow?

Original comment by emily@soldal.org on 30 May 2012 at 9:16

GoogleCodeExporter commented 9 years ago

Hmmmm.  I'm not sure how I really want to deal with the strict/lazy thing; we 
could leave that issue as "dependent on the precise grouping," or perhaps we 
could split it up into different classes?  I'm not sure. =/

Original comment by wasserman.louis on 30 May 2012 at 9:19

GoogleCodeExporter commented 9 years ago

Original comment by kevinb@google.com on 19 Jun 2012 at 5:34

Changed state: Acknowledged
Added labels: Package-Collect, Type-Addition

GoogleCodeExporter commented 9 years ago

Original comment by kevinb@google.com on 22 Jun 2012 at 6:16

Changed state: Research

GoogleCodeExporter commented 9 years ago

I think we can get 95% of the way there by handling two cases:

1) unsorted data with a key (already supported by Multimaps.index)
2) sorted data with a Comparator (to be supported)

I'm tempted to view (2) as related to issue 477, the head/tail / skip/limit / 
drop/take method.  Both operate on sorted data, and that data is quite possibly 
keyed by some field.  Building off some ideas by Tomas and Louis, a List 
version might look something like this:

  class KeySortedList<E, K> {
    static <E, K> KeySortedList<E, K> createFromSorted(
        List<E> elements, Function<E, K> indexer, Comparator<K> comparator);
    static <E, K> KeySortedList<E, K> sortAndCreate(
        Iterable<E> elements, Function<E, K> indexer, Comparator<K> comparator);

    KeySortedList<E, K> head(K toKey, BoundType boundType);
    KeySortedList<E, K> tail(K fromKey, BoundType boundType);
    KeySortedList<E, K> sub(
        K fromKey, BoundType fromBoundType, K toKey, BoundType toBoundType);

    ImmutableMultimap<K, E> index(); // unnecessary given Multimaps.index?

    // potentially higher, lower, ceiling, floor
    // access to underlying List
    // potentially insertion and removal
    // potentially unique indexing
    // potentially containsKey
  }

Iterator is basically the same except that the input must be pre-sorted and 
that the grouping operation is a little stranger:

  class KeySortedIterator<E, K> {
    static <E, K> KeySortedIterator<E, K> createFromSorted(
        Iterator<E> elements, Function<E, K> indexer, Comparator<K> comparator);

    KeySortedIterator<E, K> head(K toKey, BoundType boundType);
    KeySortedIterator<E, K> tail(K fromKey, BoundType boundType);
    KeySortedIterator<E, K> sub(
        K fromKey, BoundType fromBoundType, K toKey, BoundType toBoundType);

    UnmodifiableIterator<ImmutableList<E>> asGroupIterator();

    // potentially higher, lower, ceiling, floor
    // access to underlying Iterator
  }

Prior art:

- We have a package-private SortedLists class that performs some operations 
like KeySortedList's, including its use of key-based access.

- As I noted on issue 477, we had an "OrderedIterator" class before inside 
Google, and it was barely used.  However, it offered none of these methods, nor 
did it have key-based access.

Questions:

- Do these interfaces look useful?

- Do we need both List and Iterator versions?  If so, would a generic 
KeySorted<C, K> type, where C is List<E> or Iterator<E>, be better or worse 
than two separate types?

- Is the Iterator version useful only for the heavyweight kind of iterator I 
don't like, the kind that's pulling data from a RPC or something?  Someone 
mentioned XML somewhere; I could at least pretend that the whole input string 
has been read and that we're merely generating nodes/events on the fly, though 
the argument that we need to avoid allocating O(n) memory when we're already 
using O(n) is less compelling.  Someone mentioned the example of infinite 
iterators; that's more appealing but probably not the most common case.  I can 
probably set my reservations here aside, but if someone has a slam-dunk example 
of a "normal" Iterator that benefits significantly from partitioning on the 
fly, it would make me feel better.

- Louis, do you think that KeySortedList would cover most of use cases provided 
by SortedLists's KeyAbsentBehavior and KeyPresentBehavior?  Well, obviously 
I've left important things out, like insertion and removal, so a fairer 
question is "How much more would we need to add?" or "Are there use cases that 
KeySortedList could never cover?"  It's not really fair to ask you to do this 
when I'm the one with access to the internal SortedLists users, but since you 
introduced the Behavior enums, you probably had a better idea than I did of 
where they would be used.

Original comment by cpov...@google.com on 29 Aug 2012 at 12:26

GoogleCodeExporter commented 9 years ago

I think we should move away from the idea of grouping, grouping implies 
greedyness which is what we don't want. We want to retain the order of the 
underlying Iterable and simply partition it into chunks based on some 
condition, sorted or not.

Original comment by e...@google.com on 29 Aug 2012 at 10:55

GoogleCodeExporter commented 9 years ago

Do we have evidence that anyone is interested in an unsorted version?  As for 
"grouping" vs. "partition," is this exclusively a naming question, or does it 
go deeper than that?

Original comment by cpov...@google.com on 29 Aug 2012 at 1:17

GoogleCodeExporter commented 9 years ago

I feel that partitioning is essentially separate from ordering. To impose an 
ordering would seem presumptuous to me, even though there are legitimate cases 
for when you wish to partition over sorted data. The problem is the ordering of 
elements might be difficult to represent in a meaningful way with an ordering. 
While its origin might understand how the data is ordered, we might not, nor 
should we be required to know it.

Perhaps a bit of a contrived example but its one that captures the idea pretty 
well; Say you have a stream of genetic base pairs, its ordering is sort of 
arbitrary but relevant to the functioning of the genes. You know that in many 
cases you will have repeated entries (stuff like; AAAAAAAAAATTTTTTTT) and you 
want to handle these in a single operation, partitioning on when you go from 
one base pair to another would be the simplest way.

I know of some legitimate use-cases for this that are less contrived, most of 
the time it boils down to; "The order of these elements is magical, I can 
process them one at a time, but I'd rather do it in bulk."

Original comment by e...@google.com on 30 Aug 2012 at 7:58

GoogleCodeExporter commented 9 years ago

My fear with the package-private SortedLists class before and with the proposed 
methods now is the same -- that we're encouraging people to store their data in 
the "wrong" data structures.  Sorted data often belongs in a SortedSet; grouped 
data often belongs in a Multimap.  When possible, I'd prefer to provide 
utilities that make it easy to convert data into the more natural form (for 
those examples, ImmutableSortedSet.copyOf and Multimaps.index).

The question in my mind is "When is it impractical to store data in the 'right' 
data structure?"  I'm still trying to pin down the answer.

Original comment by cpov...@google.com on 30 Aug 2012 at 3:28

GoogleCodeExporter commented 9 years ago

To elaborate on my last comment: So far, we haven't added 
Iterators.head(Comparator, T), Iterators.uniqueKeys(Function), or 
Iterators.lastIndexOf(T).  Why?  Usually, the easiest way to handle 
sorted/grouped/indexed data is to put it in a sorted/grouped/indexed data 
structure.  The alternative is to add one-off methods with non-obvious 
preconditions (or to name the methods 
"partitionAlreadyPartitionedButNotTHATKindOfPartitioned").

To solve the preconditions problem, I'm proposing to segregate the methods that 
operate on sorted iterators into a separate type.  (I'm still kind of down on 
the iterator support in general, but I can probably live with it.)  Another 
advantage is that we can more easily support "keyed" data, which still sounds 
like the most common case.

Original comment by cpov...@google.com on 6 Sep 2012 at 9:40

GoogleCodeExporter commented 9 years ago

This issue has been migrated to GitHub.

It can be found at https://github.com/google/guava/issues/<id>

Original comment by cgdecker@google.com on 1 Nov 2014 at 4:14

Added labels: MigratedToGitHub

GoogleCodeExporter commented 9 years ago

Original comment by cgdecker@google.com on 1 Nov 2014 at 4:18

Changed state: Migrated

GoogleCodeExporter commented 9 years ago

Original comment by cgdecker@google.com on 3 Nov 2014 at 9:08

Added labels: Restrict-AddIssueComment-Commit

mkodekar / guava-libraries

Aggregation or partition structure added to Iterables #1010