mengdiwang / guava-libraries

Automatically exported from code.google.com/p/guava-libraries
Apache License 2.0
0 stars 0 forks source link

Implement List.partition alternative that accepts number of partitions rather than their size #451

Closed GoogleCodeExporter closed 9 years ago

GoogleCodeExporter commented 9 years ago
The current List.partition() guarantees that all but the last partition are of 
the specified size and that the last one may be smaller but not empty. It would 
be nice to have a version of List.partition that guarantees that it returns a 
specified number of roughly equally sized partitions (their sizes may  differ 
by 1 but no more).

The attached class does exactly that. It's lacking proper argument checking. I 
also attached a minimal  test case to illustrate the invariant.

Original issue reported on code.google.com by Hannes.S...@gmail.com on 16 Oct 2010 at 10:44

Attachments:

GoogleCodeExporter commented 9 years ago

Original comment by kevinb@google.com on 18 Jul 2011 at 3:41

GoogleCodeExporter commented 9 years ago
Any ideas for how we'd disambiguate this from Lists.partition?

Original comment by wasserman.louis on 16 Oct 2011 at 11:04

GoogleCodeExporter commented 9 years ago
Good question. Come to think of it, I typically use this functionality for 
distributing work loads equally over n workers. So I would probably call it 
Lists.distribute(). Since the strings 'partition' and 'distribute' are 
alphanumerically far from each other (think auto-complete in IDE's), I'd also 
prominently cross-link their Javadocs.

Original comment by Hannes.S...@gmail.com on 17 Oct 2011 at 12:12

GoogleCodeExporter commented 9 years ago

Original comment by fry@google.com on 10 Dec 2011 at 3:51

GoogleCodeExporter commented 9 years ago
For reference, this would basically be implemented as a one-liner:

List<List<E>> distribute(List<E> list, int partitions) {
  return partition(list, IntMath.divide(list.size(), partitions, RoundingMode.CEILING));
}

Original comment by wasserman.louis on 31 Dec 2011 at 5:24

GoogleCodeExporter commented 9 years ago
Sure? With your implementation (pseudo-code):

distribute( [1, 2, 3, 4] ), 3 ) 

I get 

[ [1,2], [3,4] ]

whereas it should give

[ [ 1, 2 ], [ 3 ], [ 4 ] ]

Note that distribute() should be guaranteed to return the specified number of 
sublists and that the sublists should be sized equally with a tolerance of 1. 
Ceiling division is simply too greedy for this application. 

Original comment by Hannes.S...@gmail.com on 3 Jan 2012 at 6:37

GoogleCodeExporter commented 9 years ago
Hrrrrrrm.  I see what you're driving at, but how would you go about documenting 
those semantics exactly?

Original comment by wasserman.louis on 3 Jan 2012 at 7:41

GoogleCodeExporter commented 9 years ago
The pre- and post-conditions on Lists.distribute() are of similar complexity as 
those of Lists.partition(). The distribute() method is not just some sugar on 
top of partition(), it complements partition() by letting the user constrain 
the number of partitions rather than their size. A good Javadoc preamble with a 
few examples would sufficiently explain this.

There is one edge case that isn't covered by the patch. If the size of the 
input list is less than the requested number of partitions, distribute() should 
probably return empty sublists in order to satisfy distribute(l,n).size() == n.

Original comment by Hannes.S...@gmail.com on 4 Jan 2012 at 2:38

GoogleCodeExporter commented 9 years ago
Wait.  Describe how you would document the semantics for which partitions are 
of what size.

Original comment by wasserman.louis on 4 Jan 2012 at 3:57

GoogleCodeExporter commented 9 years ago
The partition() Javadocs do not explicitly mention the number of returned 
sublists. Similarly, distribute() shouldn't go into detail about what exact 
size they are. It is simply not necessary to document this aspect because it is 
irrelevant for the use case. Only two constraints are of concern here: the 
number of sublists always matches the requested number and the sublists are of 
approximately equal size. Your proposed implementation doesn't satisfy the 
former constraint and is thus not applicable to the use case.

Maybe I should elaborate on the use case. Say, you have a list of 100 tasks and 
you want to process them with 
three worker (threads, cluster nodes, etc.). If you use partition( tasks, 100/3 
) you get four sublists (three of size 33 one of size 1) which is inconvenient 
for this case because you need to handle the last list specially. If you  
'round' the quotient to the ceiling as you suggest you get three lists which is 
fine. Now, let's say you have 99 workers. This is where the ceiling method 
fails because it yields only 50 sublists, not enough to utilize all workers. 
The correct implementation for the use case should return a fixed number of 
sublists (such that each worker is utilized) of roughly the same size (such 
that all workers get similar load). HTH

Original comment by Hannes.S...@gmail.com on 4 Jan 2012 at 5:04

GoogleCodeExporter commented 9 years ago
Lucky you. You are given all the tasks upfront, their count, and can do random 
access on them (you could be missing any and all of those). Isn't this simple 
enough already? :-) 

Original comment by jim.andreou on 4 Jan 2012 at 5:47

GoogleCodeExporter commented 9 years ago
"Returns consecutive sublists of a list, each of the same size (the final list 
may be smaller)."

This does give me enough information to deduce the size of each sublist.  I 
still don't have enough information to do it in your use case.

Original comment by wasserman.louis on 4 Jan 2012 at 3:52

GoogleCodeExporter commented 9 years ago
As I said, a user of distribute() simply does not need to know the size of each 
sublist. It is irrelevant. Not every pre- and post-condition of a library 
method needs to or even should be specified. Sometimes it is better to leave 
certain aspects unspecified such that you are free to change them in a later 
release. I think that the sublist exact sizes are such an aspect. 

"This method will always return the requested number of sublists. The sizes of 
the returned sublists differ by at most 1." That's all one needs to know.

Also, remember that partition() lets the user request the sublist size. Because 
of that, it is safe to assume that the user cares about their sizes and it is 
important to mention that at most one sublist may deviate from the specified 
size. With distribute() the user cares about the number of sublists, not their 
sizes so it is OK to leave that unspecified, just like partition() doesn't 
mention the number of sublists.

Original comment by Hannes.S...@gmail.com on 4 Jan 2012 at 7:02

GoogleCodeExporter commented 9 years ago
If you want to leave it unspecified in the documentation, that's fine, but the 
source code needs to have exact semantics, which I still don't understand.

I'm leaning towards jim.andreou's perspective.

Original comment by wasserman.louis on 4 Jan 2012 at 7:20

GoogleCodeExporter commented 9 years ago
Hmmmm.  I think I'm okay with this now.  Let me double-check that these 
semantics as are you expect, even if they're undocumented: the first 
(list.size() % parts) partitions will have length (list.size() / parts + 1), 
and the rest will all have list.size() / parts.  Fair?

Original comment by wasserman.louis on 4 Jan 2012 at 7:53

GoogleCodeExporter commented 9 years ago
I didn't get what jim.andreou was trying to say so I don't know what his 
perspective is. Wasn't he just making a joke? Can you elaborate on where you're 
leaning towards? 

I have implemented similar functionality for Iterables (if that's what Jim is 
getting at) but haven't gotten around to sharing it. The pace at which this 
issue has been progressing wasn't exactly encouraging to share more.

To answer your question, it's actually quite simple: In 

ls = distribute(l,n) 
q = l.size() / n
r = l.size() % n

The first r sublists of ls are of size q + 1, the rest (that is n - r sublists) 
are of size q. Basically, the division remainder is distributed over the first 
r sublists. 

As mentioned earlier, the attached patch still violates the post-condition 
ls.size() == n if l.size() < n. Unlike partition(), the patch also fails to 
propagate the RandomAccess marker interface to the result. So there is a little 
more work to do.

Original comment by Hannes.S...@gmail.com on 4 Jan 2012 at 8:14

GoogleCodeExporter commented 9 years ago
Also, the patch isn't really a patch, just code. Sorry about that.

Original comment by Hannes.S...@gmail.com on 4 Jan 2012 at 8:16

GoogleCodeExporter commented 9 years ago
Our posts overlapped, and you got it right.

Original comment by Hannes.S...@gmail.com on 4 Jan 2012 at 8:18

GoogleCodeExporter commented 9 years ago
How does this patch look to you? 
http://code.google.com/r/wassermanlouis-guava/source/detail?r=f04d62ee1bfb547edd
1aca267dc20b1937396b51&name=lists-distribute

Original comment by wasserman.louis on 4 Jan 2012 at 8:54

GoogleCodeExporter commented 9 years ago
I was trying to say, just commenting on the use case, that this is the simplest 
possible version of the problem, so simple that you even consider using List 
(or even Iterable/Iterator) to do concurrent programming. Not common.

For example, consider that list. Why do you add the tasks to the list, instead 
off handing them to your threads to start working immediately? You could be 
using a {,Blocking}Queue instead, or, typically, submitting to a 
ExecutorService, or even doing an #invokeAll if you really do get all your 
tasks in one big, undivisible step.

There are so many ways to express solutions to similar concurrency problems, 
and we should be very, very selective about the repertoire of constructs we 
build (the state of affairs is already bad enough)

Original comment by jim.andreou on 4 Jan 2012 at 9:35

GoogleCodeExporter commented 9 years ago
Looks great. More readable than what I did. 

Two minor things: In the JavaDocs I would explicitly mention that the result is 
always of the requested size. 

The other thing is that there might be a case for optionally omitting empty 
sublists in the result. In fact, whenever I needed distribute() functionality I 
never wanted the empty sublists at the end. But this is of course anecdotal 
evidence.  What I did was equivalent to 

public static <T> List<List<T>> distributeStrictly(List<T> list, int parts) {
  checkNotNull(list);
  checkArgument(parts > 0);
  return (list instanceof RandomAccess)
      ? new RandomAccessDistribution<T>(list, parts)
      : new Distribution<T>(list, parts);
}

public static <T> List<List<T>> distribute(List<T> list, int parts) {
  checkNotNull(list);
  checkArgument(parts > 0);
  parts = Math.min( parts, list.size() );
  return (list instanceof RandomAccess)
      ? new RandomAccessDistribution<T>(list, parts)
      : new Distribution<T>(list, parts);
}

but you'd probably would want to factor the return statement into a separate 
method.

Original comment by han...@eyealike.com on 4 Jan 2012 at 9:53

GoogleCodeExporter commented 9 years ago
Kevin, what do you think of comment #20?  I was surprised to see that this bug 
was accepted, but maybe you have some uses in mind that this utility would be 
especially good for...?

Original comment by cpov...@google.com on 2 Feb 2012 at 8:43

GoogleCodeExporter commented 9 years ago
I'm going to throw in my +1 argument: the code is right about the same level of 
complexity as Lists.partition, and the first application that I can think of is 
"breaking up tasks to parallelize."

Original comment by wasserman.louis on 2 Feb 2012 at 8:50

GoogleCodeExporter commented 9 years ago
Huh.  I did seem to mark it Accepted with no explanation.  I suppose that 
happened just because this issue has always bothered me.

Say you're breaking up a List to display in a table or whatever.  Right now we 
say "Want to break into N columns?  Great, here you go!  Want to break into N 
rows?  Sorry, you're completely 100% out of luck on that."  It seems a little 
inexplicably inconsistent to me.

Original comment by kevinb@google.com on 2 Feb 2012 at 9:20

GoogleCodeExporter commented 9 years ago
The implementation is nontrivial enough that I'm dubious of comment 20 (that 
it's already super easy), but easy enough that it's not a significant burden 
for us.

Original comment by wasserman.louis on 2 Feb 2012 at 9:26

GoogleCodeExporter commented 9 years ago
I didn't think that comment 20 was "it's already easy" so much as "you probably 
don't want to divide the tasks all up front": Either you don't know how many 
tasks you have, or the tasks are subdivided at multiple levels, or the tasks 
may be of different sizes, or whatever.  If you put them in a queue/executor 
and let the workers claim them as they finish their previous tasks, then you're 
done.  Plus, you may naturally get a handle on the results in a way that you 
don't here.

Original comment by cpov...@google.com on 2 Feb 2012 at 9:29

GoogleCodeExporter commented 9 years ago
Hrrrrrmkay.  I might claim that in an environment where communication is 
expensive -- a distributed application, not multiprocessor-concurrent -- that 
it might still be preferable to divide the tasks up front, rather than pay the 
expense of waiting for each worker to communicate, "Hey, I'm done with that 
task, feed me another."  But Kevin's point is also well-taken.

Original comment by wasserman.louis on 2 Feb 2012 at 9:58

GoogleCodeExporter commented 9 years ago
Adding few more examples for the broader problem:

An example of distributed programming: map/reduce. A mapper doesn't create a 
huge list of "tasks" (the entries), then divides it to send a piece to each 
reducer. It immediately divides the entries (by hashing, typically), and it 
doesn't know how many entries it will produce.

An example of parallel programming: fork/join
In particular, if a user already has all the tasks upfront, going with 
ParallelArray is better:
http://gee.cs.oswego.edu/dl/jsr166/dist/extra166ydocs/extra166y/ParallelArray.ht
ml
e.g. a withMapping(...).all() would produce the results of all tasks, and do so 
in parallel.

The latter also works great with _irregular_ tasks. Note that the suggested 
partition() assumes that all tasks are created equal, though in practice, some 
tasks are more equal than others, making its niche even smaller.

I didn't quite get the row/column comment - clarification?

Original comment by jim.andreou on 3 Feb 2012 at 12:23

GoogleCodeExporter commented 9 years ago
My understanding of the proposed method is that it would group items into a 
certain number of "rows" (# of sublists) in the same way that Lists.partition() 
groups them into a specific number of "columns" (items per sublist).  The two 
seem pretty equally valid in *concept*.

Original comment by kevin...@gmail.com on 3 Feb 2012 at 12:32

GoogleCodeExporter commented 9 years ago
Well, if you embed a 2D array into a 1D list, if you use a row-major order, 
then the current partition() would divide rows, and if you use a column-major 
order, then partition() would divide columns, which is fair.

True, one can't get a row partitioning if he uses a column-major ordering, or 
vice versa. To see if this is important, we would need examples where the user 
is forced to use the wrong ordering, or examples where the user is forced to 
partition in both orderings (ensuring that one of the two is inconvenient)...

Original comment by jim.andreou on 3 Feb 2012 at 1:21

GoogleCodeExporter commented 9 years ago
No, you are missing what I (and the OP) are trying to say.

Say you have items ABCDEFGHIJ.

We let you group into groups of 3:

 ABC
 DEF
 GHI
 J

We let you group into groups of 2:

 AB
 CD
 EF
 GH
 IJ

But if what you care about is not the *size* of each group but the *number* of 
groups -- for example, you want four of them -- you're out of luck.

Worse than being out of luck, when I was faced with this problem, I thought 
"ah, I'll just do this simple math and then call partition and voila!"  Someone 
else had to point out to me that my simple math was completely wrong and there 
is just no way to get this behavior.

The proposed method would do it.

  ABC
  DEF
  GH
  IJ

Simple as that.  You have (n % k) of size ceil(n/k), then the rest of size 
floor(n/k).

Original comment by kevinb@google.com on 3 Feb 2012 at 1:54

GoogleCodeExporter commented 9 years ago
Yep.  The behavior Kevin describes is _not_ possible to do with Lists.partition 
alone.

Original comment by wasserman.louis on 3 Feb 2012 at 1:56

GoogleCodeExporter commented 9 years ago
(This is not to comment on whether we do or don't have enough evidence that 
this is widely-needed to actually add it.)

Original comment by kevinb@google.com on 3 Feb 2012 at 1:57

GoogleCodeExporter commented 9 years ago
Let's forget about my concurrency example for a while, it was just that, an 
example of a possible use case. 

My case is that distribute() is symmetric to partition(). If you provide one 
you should provide the other. The existence of both (with cross-linked 
Javadocs) will alert the library user of the two post-conditions that affect 
the result: the sublist size and the sublist count. Without distribute(), a 
library user wanting a fixed number of sublists is likely to make the innocent 
mistake of using partition( list, list.size / numSubLists ) which does not what 
they intend. 

This happend to me and in a milder form to Louis (see comment 
http://code.google.com/p/guava-libraries/issues/detail?id=451#c5).

Original comment by Hannes.S...@gmail.com on 3 Feb 2012 at 3:12

GoogleCodeExporter commented 9 years ago
Ah, I see, I had completely misunderstood the point of Kevin's comment.

Btw, one could think distribute(list, groups) as partition(list, ceil(|list| / 
groups) and transposing the result of the latter, if viewed as a matrix. (Of 
course, a List<List<T>> would be a poor type to represent a matrix, to begin 
with).

Original comment by jim.andreou on 3 Feb 2012 at 3:48

GoogleCodeExporter commented 9 years ago
I'm not sure that's correct, though.  The rounding doesn't really work out, 
even with ceil.  Unless I misunderstand the point of the transpose?

Original comment by wasserman.louis on 3 Feb 2012 at 4:07

GoogleCodeExporter commented 9 years ago
Hannes, it is the concurrency angle for which I really disagree with suggesting 
this addition to Lists as the proper solution. If we take the concurrency use 
case off the table, I don't have any opinion on this in either direction. (I 
couldn't, I can hardly recall ever using partition() itself - any user of that 
is already more experienced in the matter). 

Original comment by jim.andreou on 3 Feb 2012 at 4:08

GoogleCodeExporter commented 9 years ago
Louis, how about transpose(partition(list, groups)) then. If this is wrong too, 
don't make me pull pen and paper, figure it yourself! :-) (in the meantime, let 
me join the club of people doing the math wrong with this problem, I guess?)

Original comment by jim.andreou on 3 Feb 2012 at 4:16

GoogleCodeExporter commented 9 years ago
I think that transpose(partition(list, groups)) is correct...since, after all, 
it corresponds exactly to Kevin's rows/columns description.

Original comment by wasserman.louis on 3 Feb 2012 at 6:24

GoogleCodeExporter commented 9 years ago
If it is correct, of course it matches Kevin's, yours, Hannes', and all other 
equivalent descriptions of this request, while adding the virtue of 
conciseness. Why say a long story when 4 words are enough? :)

Original comment by jim.andreou on 3 Feb 2012 at 7:57

GoogleCodeExporter commented 9 years ago
Because it's a different story. 

1) it orders elements differently: 

ABCDEFGHIJ

after partition(4):

ABCD
EFGH
IJ

after transpose:

AEI
BFJ
CG
DE

2) I didn't look at how transpose() is implemented but I don't think it returns 
sublist views like distribute() would.

3) It may be significantly more expensive. Why would we want to arbitrarily 
penalize one one out of two equally symmetric points of view? But this is the 
least of my headaches with the proposed alternative. The lack of write-through 
and the different ordering are bigger concerns.

Original comment by Hannes.S...@gmail.com on 3 Feb 2012 at 9:14

GoogleCodeExporter commented 9 years ago
This is just a short mathematical description, not an implementation (it is 
though O(1) to create a transposed view). Note though, since you value 
symmetry, that the result of this description is even more symmetrical to 
partition() than the version you suggest (each one would be a transpose() away 
from each other, whereas in your version it would take more than one mirror to 
show the symmetry...), so if you think this is a valid argument, you should 
favoring this flavor instead. Not that symmetry alone is enough for inclusion, 
to be sure, there would have to be enough verbose code, code that would be 
saying 'the long story' and which would be compressed to a short story if we 
add this new word -- much like how the word 'transpose' simplified the 
description of what a lot of people where describing to each other.

Original comment by andr...@google.com on 3 Feb 2012 at 6:20

GoogleCodeExporter commented 9 years ago
For the record, can we all agree that transpose(partition(list,n)) is not 
equivalent to distribute(list,n) because the ordering is different?

If you want to ditch this effort at this stage and start from scratch writing 
an O(1) version of transpose() (O(1) wrt to both time and memory) , that 
handles the edge cases (jagged sublists, for example), implements write-through 
and then also explain the interleaved ordering in your "short story", be my 
guest. I'm afraid I have spent too much time on this, already. I salute your 
tenacity in preventing this from happening, though.

Original comment by Hannes.S...@gmail.com on 4 Feb 2012 at 5:13

GoogleCodeExporter commented 9 years ago
Nothing's been prevented.

Original comment by kevinb@google.com on 4 Feb 2012 at 5:15

GoogleCodeExporter commented 9 years ago
I am absolutely pro-having these arguments.  (This is exactly what happens 
internally for just about every change.)

I am -1 on transpose, though, which seems awfully difficult to describe for the 
edge cases.  Distribute is nice, elegant, simple, and, I still believe, useful.

Original comment by wasserman.louis on 4 Feb 2012 at 5:55

GoogleCodeExporter commented 9 years ago
Can we get on with it, then? I understand concerns about library size, Guava 
becoming kitchen-sink and so on. But this is not a about adding a whole new 
flurry of classes, it's about adding 60 lines of straight-forward code, 
guaranteed to not affect backwards-compatibility, with 100% test coverage. 
Sorry for sounding cranky, it's been a long week.

Original comment by Hannes.S...@gmail.com on 4 Feb 2012 at 5:57

GoogleCodeExporter commented 9 years ago
Hannes, you've been very patient with this discussion, and I appreciate that, 
but I think we should all take a break from it and ask you to patient a bit 
longer.  It is very unlikely we're going to add this method in time for Guava 
12 anyway.

My concern is not with this method looked at as an individual addition.  You 
may believe that that's all this is about, since you talk about how many lines 
of code it has and how good its test coverage is. But these are not major 
concerns.

Rather, each small change means we have to look at the API as a whole in a 
fresh light.  With this we'd now have three versions of partition:  fixed 
sublist count, fixed sublist size, fixed sublist size with padding.  These are 
spread out oddly across Iterables, Iterators and Lists.  That's a partial cross 
product (even more so than we already have), and that's awkward.  How to name 
the new method is also a major concern, and there is a very strong chance that 
we would need to go through the pain of renaming the existing partition() 
method if we add this.  Why: because both forms are *exactly* "partitioning"; 
there's absolutely nothing in the partition name to suggest fixed sublist size.

There's more: if you can partition() an Iterable, shouldn't you be able to 
"distribute" a Collection? Why or why not?

There's more: the behavior of the List<List> returned by Lists.partition() 
stays relatively sane even if the backing list changes.  That might not be true 
with this one, and we may need to think about how to specify it properly or 
whether we have to copy sublists like Iterables.partition does.

These decisions are a lot more intricate than it seems, and we do not yet have 
enough evidence justifying the *broad* applicability of this method to even 
justify the amount of time we've already spent discussing it.

Original comment by kevinb@google.com on 4 Feb 2012 at 6:29

GoogleCodeExporter commented 9 years ago
Hannes, I think you misunderstood my comment. When I say this: "This is just a 
short mathematical description, not an implementation (it is though O(1) to 
create a transposed view)", the _only_ reason I state the complexity is because 
your (3) point in #41, which implied that matrix transposition is costly (it is 
not). Louis already has an O(1) implementation anyway. Hope it's clear now.

Original comment by jim.andreou on 5 Feb 2012 at 1:02

GoogleCodeExporter commented 9 years ago
[deleted comment]
GoogleCodeExporter commented 9 years ago
Ok, I see where you guys are coming from now. Let me just summarize my point of 
view and then I'll leave you to your decision making. Of course, there is no 
rush to get this out. I was more concerned about the amount of discussion that 
had already gone into this issue but I see now that your concerns as the (I 
assume) maintainers are at a higher level than mine. 

Based on my own anecdotal evidence, distribute() is more useful than 
partition(). Its post-condition is slightly harder to achieve and I think more 
useful in the real world. But again, this is purely anecdotal evidence. I would 
like to see both included, if merely to raise awareness of their symmetry. 

Having a transpose() on lists is also useful but to say transpose(partition) == 
distribute is just wrong because of the different ordering they produce. This 
is my main concern. After all, were talking about lists here whose main 
property is consistent ordering. The interleaved ordering produced by 
transpose(partition) would come as a big surprise to many users.

Louis, how does your transpose() handle jagged sublists?

Kevin, why do you think partition() behaves more sanely in case of changes to 
the backing list? I tend to think that distribute() and partition() behave 
equally bad/well in that case.

Original comment by Hannes.S...@gmail.com on 5 Feb 2012 at 8:38