Closed GoogleCodeExporter closed 9 years ago
Original comment by kevinb@google.com
on 18 Jul 2011 at 3:41
Any ideas for how we'd disambiguate this from Lists.partition?
Original comment by wasserman.louis
on 16 Oct 2011 at 11:04
Good question. Come to think of it, I typically use this functionality for
distributing work loads equally over n workers. So I would probably call it
Lists.distribute(). Since the strings 'partition' and 'distribute' are
alphanumerically far from each other (think auto-complete in IDE's), I'd also
prominently cross-link their Javadocs.
Original comment by Hannes.S...@gmail.com
on 17 Oct 2011 at 12:12
Original comment by fry@google.com
on 10 Dec 2011 at 3:51
For reference, this would basically be implemented as a one-liner:
List<List<E>> distribute(List<E> list, int partitions) {
return partition(list, IntMath.divide(list.size(), partitions, RoundingMode.CEILING));
}
Original comment by wasserman.louis
on 31 Dec 2011 at 5:24
Sure? With your implementation (pseudo-code):
distribute( [1, 2, 3, 4] ), 3 )
I get
[ [1,2], [3,4] ]
whereas it should give
[ [ 1, 2 ], [ 3 ], [ 4 ] ]
Note that distribute() should be guaranteed to return the specified number of
sublists and that the sublists should be sized equally with a tolerance of 1.
Ceiling division is simply too greedy for this application.
Original comment by Hannes.S...@gmail.com
on 3 Jan 2012 at 6:37
Hrrrrrrm. I see what you're driving at, but how would you go about documenting
those semantics exactly?
Original comment by wasserman.louis
on 3 Jan 2012 at 7:41
The pre- and post-conditions on Lists.distribute() are of similar complexity as
those of Lists.partition(). The distribute() method is not just some sugar on
top of partition(), it complements partition() by letting the user constrain
the number of partitions rather than their size. A good Javadoc preamble with a
few examples would sufficiently explain this.
There is one edge case that isn't covered by the patch. If the size of the
input list is less than the requested number of partitions, distribute() should
probably return empty sublists in order to satisfy distribute(l,n).size() == n.
Original comment by Hannes.S...@gmail.com
on 4 Jan 2012 at 2:38
Wait. Describe how you would document the semantics for which partitions are
of what size.
Original comment by wasserman.louis
on 4 Jan 2012 at 3:57
The partition() Javadocs do not explicitly mention the number of returned
sublists. Similarly, distribute() shouldn't go into detail about what exact
size they are. It is simply not necessary to document this aspect because it is
irrelevant for the use case. Only two constraints are of concern here: the
number of sublists always matches the requested number and the sublists are of
approximately equal size. Your proposed implementation doesn't satisfy the
former constraint and is thus not applicable to the use case.
Maybe I should elaborate on the use case. Say, you have a list of 100 tasks and
you want to process them with
three worker (threads, cluster nodes, etc.). If you use partition( tasks, 100/3
) you get four sublists (three of size 33 one of size 1) which is inconvenient
for this case because you need to handle the last list specially. If you
'round' the quotient to the ceiling as you suggest you get three lists which is
fine. Now, let's say you have 99 workers. This is where the ceiling method
fails because it yields only 50 sublists, not enough to utilize all workers.
The correct implementation for the use case should return a fixed number of
sublists (such that each worker is utilized) of roughly the same size (such
that all workers get similar load). HTH
Original comment by Hannes.S...@gmail.com
on 4 Jan 2012 at 5:04
Lucky you. You are given all the tasks upfront, their count, and can do random
access on them (you could be missing any and all of those). Isn't this simple
enough already? :-)
Original comment by jim.andreou
on 4 Jan 2012 at 5:47
"Returns consecutive sublists of a list, each of the same size (the final list
may be smaller)."
This does give me enough information to deduce the size of each sublist. I
still don't have enough information to do it in your use case.
Original comment by wasserman.louis
on 4 Jan 2012 at 3:52
As I said, a user of distribute() simply does not need to know the size of each
sublist. It is irrelevant. Not every pre- and post-condition of a library
method needs to or even should be specified. Sometimes it is better to leave
certain aspects unspecified such that you are free to change them in a later
release. I think that the sublist exact sizes are such an aspect.
"This method will always return the requested number of sublists. The sizes of
the returned sublists differ by at most 1." That's all one needs to know.
Also, remember that partition() lets the user request the sublist size. Because
of that, it is safe to assume that the user cares about their sizes and it is
important to mention that at most one sublist may deviate from the specified
size. With distribute() the user cares about the number of sublists, not their
sizes so it is OK to leave that unspecified, just like partition() doesn't
mention the number of sublists.
Original comment by Hannes.S...@gmail.com
on 4 Jan 2012 at 7:02
If you want to leave it unspecified in the documentation, that's fine, but the
source code needs to have exact semantics, which I still don't understand.
I'm leaning towards jim.andreou's perspective.
Original comment by wasserman.louis
on 4 Jan 2012 at 7:20
Hmmmm. I think I'm okay with this now. Let me double-check that these
semantics as are you expect, even if they're undocumented: the first
(list.size() % parts) partitions will have length (list.size() / parts + 1),
and the rest will all have list.size() / parts. Fair?
Original comment by wasserman.louis
on 4 Jan 2012 at 7:53
I didn't get what jim.andreou was trying to say so I don't know what his
perspective is. Wasn't he just making a joke? Can you elaborate on where you're
leaning towards?
I have implemented similar functionality for Iterables (if that's what Jim is
getting at) but haven't gotten around to sharing it. The pace at which this
issue has been progressing wasn't exactly encouraging to share more.
To answer your question, it's actually quite simple: In
ls = distribute(l,n)
q = l.size() / n
r = l.size() % n
The first r sublists of ls are of size q + 1, the rest (that is n - r sublists)
are of size q. Basically, the division remainder is distributed over the first
r sublists.
As mentioned earlier, the attached patch still violates the post-condition
ls.size() == n if l.size() < n. Unlike partition(), the patch also fails to
propagate the RandomAccess marker interface to the result. So there is a little
more work to do.
Original comment by Hannes.S...@gmail.com
on 4 Jan 2012 at 8:14
Also, the patch isn't really a patch, just code. Sorry about that.
Original comment by Hannes.S...@gmail.com
on 4 Jan 2012 at 8:16
Our posts overlapped, and you got it right.
Original comment by Hannes.S...@gmail.com
on 4 Jan 2012 at 8:18
How does this patch look to you?
http://code.google.com/r/wassermanlouis-guava/source/detail?r=f04d62ee1bfb547edd
1aca267dc20b1937396b51&name=lists-distribute
Original comment by wasserman.louis
on 4 Jan 2012 at 8:54
I was trying to say, just commenting on the use case, that this is the simplest
possible version of the problem, so simple that you even consider using List
(or even Iterable/Iterator) to do concurrent programming. Not common.
For example, consider that list. Why do you add the tasks to the list, instead
off handing them to your threads to start working immediately? You could be
using a {,Blocking}Queue instead, or, typically, submitting to a
ExecutorService, or even doing an #invokeAll if you really do get all your
tasks in one big, undivisible step.
There are so many ways to express solutions to similar concurrency problems,
and we should be very, very selective about the repertoire of constructs we
build (the state of affairs is already bad enough)
Original comment by jim.andreou
on 4 Jan 2012 at 9:35
Looks great. More readable than what I did.
Two minor things: In the JavaDocs I would explicitly mention that the result is
always of the requested size.
The other thing is that there might be a case for optionally omitting empty
sublists in the result. In fact, whenever I needed distribute() functionality I
never wanted the empty sublists at the end. But this is of course anecdotal
evidence. What I did was equivalent to
public static <T> List<List<T>> distributeStrictly(List<T> list, int parts) {
checkNotNull(list);
checkArgument(parts > 0);
return (list instanceof RandomAccess)
? new RandomAccessDistribution<T>(list, parts)
: new Distribution<T>(list, parts);
}
public static <T> List<List<T>> distribute(List<T> list, int parts) {
checkNotNull(list);
checkArgument(parts > 0);
parts = Math.min( parts, list.size() );
return (list instanceof RandomAccess)
? new RandomAccessDistribution<T>(list, parts)
: new Distribution<T>(list, parts);
}
but you'd probably would want to factor the return statement into a separate
method.
Original comment by han...@eyealike.com
on 4 Jan 2012 at 9:53
Kevin, what do you think of comment #20? I was surprised to see that this bug
was accepted, but maybe you have some uses in mind that this utility would be
especially good for...?
Original comment by cpov...@google.com
on 2 Feb 2012 at 8:43
I'm going to throw in my +1 argument: the code is right about the same level of
complexity as Lists.partition, and the first application that I can think of is
"breaking up tasks to parallelize."
Original comment by wasserman.louis
on 2 Feb 2012 at 8:50
Huh. I did seem to mark it Accepted with no explanation. I suppose that
happened just because this issue has always bothered me.
Say you're breaking up a List to display in a table or whatever. Right now we
say "Want to break into N columns? Great, here you go! Want to break into N
rows? Sorry, you're completely 100% out of luck on that." It seems a little
inexplicably inconsistent to me.
Original comment by kevinb@google.com
on 2 Feb 2012 at 9:20
The implementation is nontrivial enough that I'm dubious of comment 20 (that
it's already super easy), but easy enough that it's not a significant burden
for us.
Original comment by wasserman.louis
on 2 Feb 2012 at 9:26
I didn't think that comment 20 was "it's already easy" so much as "you probably
don't want to divide the tasks all up front": Either you don't know how many
tasks you have, or the tasks are subdivided at multiple levels, or the tasks
may be of different sizes, or whatever. If you put them in a queue/executor
and let the workers claim them as they finish their previous tasks, then you're
done. Plus, you may naturally get a handle on the results in a way that you
don't here.
Original comment by cpov...@google.com
on 2 Feb 2012 at 9:29
Hrrrrrmkay. I might claim that in an environment where communication is
expensive -- a distributed application, not multiprocessor-concurrent -- that
it might still be preferable to divide the tasks up front, rather than pay the
expense of waiting for each worker to communicate, "Hey, I'm done with that
task, feed me another." But Kevin's point is also well-taken.
Original comment by wasserman.louis
on 2 Feb 2012 at 9:58
Adding few more examples for the broader problem:
An example of distributed programming: map/reduce. A mapper doesn't create a
huge list of "tasks" (the entries), then divides it to send a piece to each
reducer. It immediately divides the entries (by hashing, typically), and it
doesn't know how many entries it will produce.
An example of parallel programming: fork/join
In particular, if a user already has all the tasks upfront, going with
ParallelArray is better:
http://gee.cs.oswego.edu/dl/jsr166/dist/extra166ydocs/extra166y/ParallelArray.ht
ml
e.g. a withMapping(...).all() would produce the results of all tasks, and do so
in parallel.
The latter also works great with _irregular_ tasks. Note that the suggested
partition() assumes that all tasks are created equal, though in practice, some
tasks are more equal than others, making its niche even smaller.
I didn't quite get the row/column comment - clarification?
Original comment by jim.andreou
on 3 Feb 2012 at 12:23
My understanding of the proposed method is that it would group items into a
certain number of "rows" (# of sublists) in the same way that Lists.partition()
groups them into a specific number of "columns" (items per sublist). The two
seem pretty equally valid in *concept*.
Original comment by kevin...@gmail.com
on 3 Feb 2012 at 12:32
Well, if you embed a 2D array into a 1D list, if you use a row-major order,
then the current partition() would divide rows, and if you use a column-major
order, then partition() would divide columns, which is fair.
True, one can't get a row partitioning if he uses a column-major ordering, or
vice versa. To see if this is important, we would need examples where the user
is forced to use the wrong ordering, or examples where the user is forced to
partition in both orderings (ensuring that one of the two is inconvenient)...
Original comment by jim.andreou
on 3 Feb 2012 at 1:21
No, you are missing what I (and the OP) are trying to say.
Say you have items ABCDEFGHIJ.
We let you group into groups of 3:
ABC
DEF
GHI
J
We let you group into groups of 2:
AB
CD
EF
GH
IJ
But if what you care about is not the *size* of each group but the *number* of
groups -- for example, you want four of them -- you're out of luck.
Worse than being out of luck, when I was faced with this problem, I thought
"ah, I'll just do this simple math and then call partition and voila!" Someone
else had to point out to me that my simple math was completely wrong and there
is just no way to get this behavior.
The proposed method would do it.
ABC
DEF
GH
IJ
Simple as that. You have (n % k) of size ceil(n/k), then the rest of size
floor(n/k).
Original comment by kevinb@google.com
on 3 Feb 2012 at 1:54
Yep. The behavior Kevin describes is _not_ possible to do with Lists.partition
alone.
Original comment by wasserman.louis
on 3 Feb 2012 at 1:56
(This is not to comment on whether we do or don't have enough evidence that
this is widely-needed to actually add it.)
Original comment by kevinb@google.com
on 3 Feb 2012 at 1:57
Let's forget about my concurrency example for a while, it was just that, an
example of a possible use case.
My case is that distribute() is symmetric to partition(). If you provide one
you should provide the other. The existence of both (with cross-linked
Javadocs) will alert the library user of the two post-conditions that affect
the result: the sublist size and the sublist count. Without distribute(), a
library user wanting a fixed number of sublists is likely to make the innocent
mistake of using partition( list, list.size / numSubLists ) which does not what
they intend.
This happend to me and in a milder form to Louis (see comment
http://code.google.com/p/guava-libraries/issues/detail?id=451#c5).
Original comment by Hannes.S...@gmail.com
on 3 Feb 2012 at 3:12
Ah, I see, I had completely misunderstood the point of Kevin's comment.
Btw, one could think distribute(list, groups) as partition(list, ceil(|list| /
groups) and transposing the result of the latter, if viewed as a matrix. (Of
course, a List<List<T>> would be a poor type to represent a matrix, to begin
with).
Original comment by jim.andreou
on 3 Feb 2012 at 3:48
I'm not sure that's correct, though. The rounding doesn't really work out,
even with ceil. Unless I misunderstand the point of the transpose?
Original comment by wasserman.louis
on 3 Feb 2012 at 4:07
Hannes, it is the concurrency angle for which I really disagree with suggesting
this addition to Lists as the proper solution. If we take the concurrency use
case off the table, I don't have any opinion on this in either direction. (I
couldn't, I can hardly recall ever using partition() itself - any user of that
is already more experienced in the matter).
Original comment by jim.andreou
on 3 Feb 2012 at 4:08
Louis, how about transpose(partition(list, groups)) then. If this is wrong too,
don't make me pull pen and paper, figure it yourself! :-) (in the meantime, let
me join the club of people doing the math wrong with this problem, I guess?)
Original comment by jim.andreou
on 3 Feb 2012 at 4:16
I think that transpose(partition(list, groups)) is correct...since, after all,
it corresponds exactly to Kevin's rows/columns description.
Original comment by wasserman.louis
on 3 Feb 2012 at 6:24
If it is correct, of course it matches Kevin's, yours, Hannes', and all other
equivalent descriptions of this request, while adding the virtue of
conciseness. Why say a long story when 4 words are enough? :)
Original comment by jim.andreou
on 3 Feb 2012 at 7:57
Because it's a different story.
1) it orders elements differently:
ABCDEFGHIJ
after partition(4):
ABCD
EFGH
IJ
after transpose:
AEI
BFJ
CG
DE
2) I didn't look at how transpose() is implemented but I don't think it returns
sublist views like distribute() would.
3) It may be significantly more expensive. Why would we want to arbitrarily
penalize one one out of two equally symmetric points of view? But this is the
least of my headaches with the proposed alternative. The lack of write-through
and the different ordering are bigger concerns.
Original comment by Hannes.S...@gmail.com
on 3 Feb 2012 at 9:14
This is just a short mathematical description, not an implementation (it is
though O(1) to create a transposed view). Note though, since you value
symmetry, that the result of this description is even more symmetrical to
partition() than the version you suggest (each one would be a transpose() away
from each other, whereas in your version it would take more than one mirror to
show the symmetry...), so if you think this is a valid argument, you should
favoring this flavor instead. Not that symmetry alone is enough for inclusion,
to be sure, there would have to be enough verbose code, code that would be
saying 'the long story' and which would be compressed to a short story if we
add this new word -- much like how the word 'transpose' simplified the
description of what a lot of people where describing to each other.
Original comment by andr...@google.com
on 3 Feb 2012 at 6:20
For the record, can we all agree that transpose(partition(list,n)) is not
equivalent to distribute(list,n) because the ordering is different?
If you want to ditch this effort at this stage and start from scratch writing
an O(1) version of transpose() (O(1) wrt to both time and memory) , that
handles the edge cases (jagged sublists, for example), implements write-through
and then also explain the interleaved ordering in your "short story", be my
guest. I'm afraid I have spent too much time on this, already. I salute your
tenacity in preventing this from happening, though.
Original comment by Hannes.S...@gmail.com
on 4 Feb 2012 at 5:13
Nothing's been prevented.
Original comment by kevinb@google.com
on 4 Feb 2012 at 5:15
I am absolutely pro-having these arguments. (This is exactly what happens
internally for just about every change.)
I am -1 on transpose, though, which seems awfully difficult to describe for the
edge cases. Distribute is nice, elegant, simple, and, I still believe, useful.
Original comment by wasserman.louis
on 4 Feb 2012 at 5:55
Can we get on with it, then? I understand concerns about library size, Guava
becoming kitchen-sink and so on. But this is not a about adding a whole new
flurry of classes, it's about adding 60 lines of straight-forward code,
guaranteed to not affect backwards-compatibility, with 100% test coverage.
Sorry for sounding cranky, it's been a long week.
Original comment by Hannes.S...@gmail.com
on 4 Feb 2012 at 5:57
Hannes, you've been very patient with this discussion, and I appreciate that,
but I think we should all take a break from it and ask you to patient a bit
longer. It is very unlikely we're going to add this method in time for Guava
12 anyway.
My concern is not with this method looked at as an individual addition. You
may believe that that's all this is about, since you talk about how many lines
of code it has and how good its test coverage is. But these are not major
concerns.
Rather, each small change means we have to look at the API as a whole in a
fresh light. With this we'd now have three versions of partition: fixed
sublist count, fixed sublist size, fixed sublist size with padding. These are
spread out oddly across Iterables, Iterators and Lists. That's a partial cross
product (even more so than we already have), and that's awkward. How to name
the new method is also a major concern, and there is a very strong chance that
we would need to go through the pain of renaming the existing partition()
method if we add this. Why: because both forms are *exactly* "partitioning";
there's absolutely nothing in the partition name to suggest fixed sublist size.
There's more: if you can partition() an Iterable, shouldn't you be able to
"distribute" a Collection? Why or why not?
There's more: the behavior of the List<List> returned by Lists.partition()
stays relatively sane even if the backing list changes. That might not be true
with this one, and we may need to think about how to specify it properly or
whether we have to copy sublists like Iterables.partition does.
These decisions are a lot more intricate than it seems, and we do not yet have
enough evidence justifying the *broad* applicability of this method to even
justify the amount of time we've already spent discussing it.
Original comment by kevinb@google.com
on 4 Feb 2012 at 6:29
Hannes, I think you misunderstood my comment. When I say this: "This is just a
short mathematical description, not an implementation (it is though O(1) to
create a transposed view)", the _only_ reason I state the complexity is because
your (3) point in #41, which implied that matrix transposition is costly (it is
not). Louis already has an O(1) implementation anyway. Hope it's clear now.
Original comment by jim.andreou
on 5 Feb 2012 at 1:02
[deleted comment]
Ok, I see where you guys are coming from now. Let me just summarize my point of
view and then I'll leave you to your decision making. Of course, there is no
rush to get this out. I was more concerned about the amount of discussion that
had already gone into this issue but I see now that your concerns as the (I
assume) maintainers are at a higher level than mine.
Based on my own anecdotal evidence, distribute() is more useful than
partition(). Its post-condition is slightly harder to achieve and I think more
useful in the real world. But again, this is purely anecdotal evidence. I would
like to see both included, if merely to raise awareness of their symmetry.
Having a transpose() on lists is also useful but to say transpose(partition) ==
distribute is just wrong because of the different ordering they produce. This
is my main concern. After all, were talking about lists here whose main
property is consistent ordering. The interleaved ordering produced by
transpose(partition) would come as a big surprise to many users.
Louis, how does your transpose() handle jagged sublists?
Kevin, why do you think partition() behaves more sanely in case of changes to
the backing list? I tend to think that distribute() and partition() behave
equally bad/well in that case.
Original comment by Hannes.S...@gmail.com
on 5 Feb 2012 at 8:38
Original issue reported on code.google.com by
Hannes.S...@gmail.com
on 16 Oct 2010 at 10:44Attachments: