mengdiwang / guava-libraries

Automatically exported from code.google.com/p/guava-libraries
Apache License 2.0
0 stars 0 forks source link

Add the DisjointSet data structure #521

Closed GoogleCodeExporter closed 9 years ago

GoogleCodeExporter commented 9 years ago
It's a useful data structure that is not so easy to implement by itself, and 
that is very useful in algorithmics, such as in the graph theory, for instance.

Original issue reported on code.google.com by ogregoire on 12 Jan 2011 at 11:58

GoogleCodeExporter commented 9 years ago
We have a few uses for such a thing ourselves.  I started work on it once upon 
a time...

The name has always bugged me, though; "PartitionedSet" would seem to make more 
sense, but perhaps it's not worth going against the grain.

Original comment by kevinb@google.com on 13 Jan 2011 at 1:25

GoogleCodeExporter commented 9 years ago
for the general interest: 
http://en.wikipedia.org/wiki/Disjoint-set_data_structure

Original comment by kevinb@google.com on 13 Jan 2011 at 1:26

GoogleCodeExporter commented 9 years ago
Any thoughts on what the API would look like?

Original comment by wasserman.louis on 2 Feb 2011 at 6:05

GoogleCodeExporter commented 9 years ago
Disturbingly, I can't find my work on it anywhere.

Rough thoughts were:  DisjointSet<E> interface containing the read-only 
methods.  One to view as a Set<Set<E>>, one to view the union Set<E>, one to 
get the Set<E> for a given E, one to view the entire thing as an Equivalence<E>.

Mutable implementation, with or without specialized mutable interface. Method 
to declare e1 and e2 to be equivalent.  This is where the traditional 
union-find algorithm ends up.  Method to remove an E?

Immutable implementation.  Presumably backed by an ImmutableMap<E, 
ImmutableSet<E>>.

Various other methods I forgot.

Louis, if you are up for taking a crack at this, this is something I feel we 
could push through.

Original comment by kevinb@google.com on 3 Feb 2011 at 3:07

GoogleCodeExporter commented 9 years ago
Absolutely up for it -- after my paper due Friday. ;)

Original comment by wasserman.louis on 3 Feb 2011 at 3:45

GoogleCodeExporter commented 9 years ago
[deleted comment]
GoogleCodeExporter commented 9 years ago
[deleted comment]
GoogleCodeExporter commented 9 years ago
[deleted comment]
GoogleCodeExporter commented 9 years ago
The key thing in this structure is that the *elements* point to the set, 
instead of the opposite way which is what we typically mean when we say "a 
Set". So, please, don't subtype Set, don't try to shoehorn an iterator or a 
remove operation. The result would be negating the strengths of the structure; 
the quintessential union/find is best.

After some digging, here's an implementation of my own, some 4-5 years old.
http://code.google.com/p/flexigraph/source/browse/trunk/src/gr/forth/ics/util/Pa
rtition.java

(It seems I took the time to implement the best find() version, instead of the 
easiest, which is nice; single-pass, instead of two-pass or recursive). 

Let me make a concrete recommendation now though. I'll call the structure 
"Partition", below:

 * Don't make Partition point to a user value. This is completely unnecessary; the user will associate each value to its Partition even if Partition offers the other direction. (I made this mistake)
 * Following from the above: don't introduce a type parameter. This artificially makes it difficult to merge Partition<S> and Partition<T>. (I made this mistake too).
 * No "iterator" or anything that requires the set to point to the elements, which defeats the purpose of the structure.
 * Expose the equivalence relation only. A singleton Equivalence<Partition> is a fine way to do it. Having Equivalence<Partition<?>> would look ugly, for a reason (because the type parameter isn't needed!). Don't expose the representative of an element (btw, remember to do the self-loop trick to denote the root). Other than the equivalence, just singleton() and merge() need to be offered.
 * Finally, consider offering this method: 
public static <E> Equivalence<E> partition(Function<E, Partition> partitioner);
Which maps the equivalence relation of Partition to an equivalence relation of 
referrers of Partition.

I don't care too much about the "Partition" name, but I do care about the rest. 
Make a guy with a weak spot for graph algorithms happy - otherwise I'll go and 
whine to Robert Tarjan himself. Seriously. (Ok, perhaps not that seriously, but 
seriously, I'll be sad, this is a gem of a structure, lets preserve that).

Original comment by jim.andreou on 3 Feb 2011 at 6:20

GoogleCodeExporter commented 9 years ago
Ok, let's make sure we sort out what to do before you dig into this, Louis.

Original comment by kevinb@google.com on 3 Feb 2011 at 2:02

GoogleCodeExporter commented 9 years ago
I swear, I have a weaker spot for graph algorithms than you, and my biggest 
soft spot of all is for data structure gems!

I like some ideas from your approach, Jim, but I think it can be made 
considerably simpler.  Specifically, I think the attached is pretty much your 
approach taken to its logical conclusion -- to remove the elements from the 
picture entirely, and treat the partitions as objects in their own right.  The 
effect of p1.combine(p2) is to guarantee that p1.equals(p2) is true for all 
time, while maintaining that equals() is an equivalence relation.  Elements 
don't come into it at all.

Original comment by wasserman.louis on 5 Feb 2011 at 12:32

Attachments:

GoogleCodeExporter commented 9 years ago
I like the simplicity of equals(), heck, that's supposed to model an 
equivalence class, and Equivalences already play nice with it, so it's fine. 

About find(), I see you implemented the recursive one for simplicity (or 
laziness :)) - but we ought to provide the fastest. 

Now, about the rank... I'm on the fence. I think without the rank optimization, 
one gets polylogarithmic complexity, instead of the inverse Ackerman (together 
with path compression). The rank itself "ought" to increase the memory overhead 
(I'm certain that some people more knowledgeable than me asserted that), but 
I'm not so sure, it must be VM dependant. On the VMs I'm usually on, there 
would be no observable difference to me, due to aligning effects (the existence 
of the Partition pointer would bump the cost of the object from 8 bytes to 16 
bytes, so in this case, the second field is free). 

But without that field, it would be possible to pull off a concurrent, 
non-blocking implementation (just via compare-and-swap on the pointer), which 
would be nice, because this structure does play well with partitioning things 
to parallelize, then merge the sets to compute some global result. 

So, how about having an interface for this, making an sequential implementation 
w/ the rank, and leave the door open for a non-blocking one without the rank.

Btw, I was quick to play down the "Partition" name, but I think it's quite good 
(the other good names would be "Class" and "Set", obviously poor choices in 
this context :)).

(PS: I'm _not_ going to argue with you about who is more addicted to graph 
algorithms :P )

Original comment by jim.andreou on 5 Feb 2011 at 1:01

GoogleCodeExporter commented 9 years ago
Well, yes, I was going for a quick-and-dirty implementation, so I was somewhat 
lazy.

I'm inclined to do the sequential implementation for the moment, with the 
lovely inverse-Ackermann complexity.

Original comment by wasserman.louis on 5 Feb 2011 at 1:08

GoogleCodeExporter commented 9 years ago
I like the sound of the (very different from what I had been thinking!) 
direction you guys are taking this.

Original comment by kevinb@google.com on 5 Feb 2011 at 2:07

GoogleCodeExporter commented 9 years ago
Oh, yeah.  I went through several drafts of my first response that were along 
the lines of "You're on crack!  What a ridiculous data structure!"  Then it was 
like, oh man, that's actually not half bad.  Then it was like, but this could 
be so much simpler!  Bam!  Bam!  .equals()!

So yeah.

Original comment by wasserman.louis on 5 Feb 2011 at 5:06

GoogleCodeExporter commented 9 years ago
Oh boy. How old is this guy? :)

Original comment by jim.andreou on 5 Feb 2011 at 5:19

GoogleCodeExporter commented 9 years ago
More seriously, let me back off that interface idea. Rethinking it, that would 
create ugly problems, like sequential.{merge,equals}(nonSequential); // oops
So lets just have a single concrete class; if we need the other, we create a 
second, unrelated class. 

Original comment by jim.andreou on 5 Feb 2011 at 5:24

GoogleCodeExporter commented 9 years ago
I think that's definitely the way to go, and I think a sequential version is 
the right default.

Original comment by wasserman.louis on 5 Feb 2011 at 5:50

GoogleCodeExporter commented 9 years ago
Just so I don't have this in my mind: Louis? Is this something we should expect 
you give a full contribution of? (I mean, the final product, optimized, 
documented -especially hashCode() and equals()-, some tests). Is it something 
you're already working on? 

Original comment by jim.andreou on 14 Feb 2011 at 6:31

GoogleCodeExporter commented 9 years ago
Yeah, I'm doing the whole shebang.

Original comment by wasserman.louis on 14 Feb 2011 at 6:39

GoogleCodeExporter commented 9 years ago
Ok, great, thanks!

Original comment by jim.andreou on 14 Feb 2011 at 6:43

GoogleCodeExporter commented 9 years ago
Observation: I think this should go in base, not collect, since it's really not 
a collection in any way.

Original comment by wasserman.louis on 14 Feb 2011 at 9:44

GoogleCodeExporter commented 9 years ago
I'm late to this discussion, so please forgive me. If I'm grokking the gist of 
the approach (referring to Wasserman's posted impl):
- the user is responsible for associating user-provided objects with Partition 
instances
- Partitions never disappear, but are merged as "flatly" as possible

I really like the simplicity, but I have some questions about how one would use 
it.

Wouldn't algorithms which start with a state of one object per partition be 
somewhat inefficient? By that, I mean that the user of the library will be 
keeping around a HashMap< E, Partition >, and there will permanently be just as 
many Partitions as there are elements. Although I suppose if the user didn't do 
it, then the Partition impl would have to. So maybe this is just shifting the 
burden, with the possibility that the user won't care and it wouldn't be needed 
at all.

Having run a connected components algorithm, I can easily determine whether any 
two vertices are in the same connected component (via the Equivalence). But how 
can I determine how many components there are? Or, how can I list the vertices 
in each component? Do you need to expose the find() method to allow answering 
questions like this, or is the intent that the user should be keeping track of 
this kind of information? Decrementing the number of components every time a 
combine() is successful is easy enough, and Sets.union() or similar handles the 
latter question (with the user keeping track of all the Sets), but this does 
seem like exactly the kind of thing someone might want to use Partition for.

Original comment by ray.a.co...@gmail.com on 4 May 2011 at 12:10

GoogleCodeExporter commented 9 years ago
"- the user is responsible for associating user-provided objects with Partition 
instances"
Yes. Note that a more efficient alternative to Map<E, Partition>, is a 
Partition field in E itself. 

"- Partitions never disappear, but are merged as "flatly" as possible"
They disappear at the discretion of the garbage collector :)

One object per partition is the minimum you can get.

"But how can I determine how many components there are?"
A good start would be Multimaps#index(Iterable<E>, Function<E, Partition>), 
which gives a Multimap<Partition, E>. But to turn it into a more useful 
Multimap<E, E>, one would require the inverse relation, Map<Partition, E>, 
which leads to rather verbose code (trust me, I tried it).

My original (internal) proposal included a "classify" method which for the same 
arguments returned Multimap<E, E> instead, directly. It's still up in the air 
(seconds ago I re-suggested that, with the return type being Map<E, Set<E>>, 
lets see what happens to that). That would answer all your following questions; 
exposing find() is still unnecessary.

Original comment by jim.andreou on 4 May 2011 at 12:28

GoogleCodeExporter commented 9 years ago
Maybe a contrived example would help, and you can tell me where my thinking has 
gone wrong. I start out with a big graph...

  Map< Node, Partition > map = Maps.newHashMap();
  for( node : nodes ) {
    map.put( node, Partition.newPartition() );
  }

Now, inelegant brute-force connected components:

  int size = nodes.size();
  for( edge : edges ) {
    Partition pTail = map.get( edge.tail );
    Partition pHead = map.get( edge.head );
    if( pTail.combine( pHead ) ) {
      size--;
    }
    if( size == 1 ) {
      break;
    }
  }

The calculation is done, and the Partition Equivalence would work. However, 
there are still 1M Partition instances, one per node. From the information I 
have, I cannot construct a Function<E, Partition> that is "normalized". So 
Multimaps.index() wouldn't be useful.

This edit reduces the number of Partitions in use by one allowing pHead to be 
gc'd, but only if pHead happens to only contain the one node.

    if( pTail.combine( pHead ) ) {
      map.put( edge.head, pTail );
      size--;
    }

As near as I can tell, the only way to keep track of this information is to 
maintain an actual Map<Partition, Set<E>> during the algorithm, in addition to 
the existing Map. A single graph data structure would be simpler, and then I 
would effectively be performing the exact same function as Partition, and I 
don't need the class at all.

What am I missing?

Original comment by ray.a.co...@gmail.com on 4 May 2011 at 6:31

GoogleCodeExporter commented 9 years ago
#index you would get a Multimap<Partition, E>. Just iterate over the
<Partition, E> entries, and nominate the key Partition as the "canonical"
one for the value E. Seems good enough.

Original comment by jim.andreou on 4 May 2011 at 6:59

GoogleCodeExporter commented 9 years ago
Not true!  After a combine() is performed, the only way to tell the
Partition instances apart is object identity, which Multimaps.index()
doesn't use.  Try it yourself: once Partitions have been combined, they're
equal as far as Object.equals() and Object.hashCode() are concerned, which
is all Multimaps.index() cares about.

Louis

Original comment by wasserman.louis on 4 May 2011 at 8:33

GoogleCodeExporter commented 9 years ago
I don't understand who you are responding to and what is "not true", could you 
clarify?

Original comment by jim.andreou on 4 May 2011 at 8:53

GoogleCodeExporter commented 9 years ago
Sorry, I tried to respond in an email instead of in the comment interface.

> The calculation is done, and the Partition Equivalence would work. However, 
there 
> are still 1M Partition instances, one per node. From the information I have, 
I 
> cannot construct a Function<E, Partition> that is "normalized". So 
Multimaps.index()
> wouldn't be useful.

Not true!  After a combine() is performed, the only way to tell the
Partition instances apart is object identity, which Multimaps.index()
doesn't use.  Try it yourself: once Partitions have been combined, they're
equal as far as Object.equals() and Object.hashCode() are concerned, which
is all Multimaps.index() cares about.

Louis

Original comment by wasserman.louis on 4 May 2011 at 9:45

GoogleCodeExporter commented 9 years ago
[slaps forehead, utters "Doh!", reaches for beer...]

Original comment by ray.a.co...@gmail.com on 6 May 2011 at 9:57

GoogleCodeExporter commented 9 years ago
@jim.andreou, does it matter if there is a race between two threads 
comparing/updating a rank?  If I understand the code correctly this would 
result in a suboptimal tree, but still a valid one.

Original comment by fin...@gmail.com on 21 May 2011 at 12:03

GoogleCodeExporter commented 9 years ago
Yes, it matters, eg two threads might try to link a partition to different 
partitions, and one would overwrite the other. I think the only way this could 
be used in a multithreaded environment would be grabbing a common lock. (Even 
locking the partitions themselves, say in Ordering.arbitrary() order, wouldn't 
help). 

Original comment by jim.andreou on 21 May 2011 at 6:20

GoogleCodeExporter commented 9 years ago
[deleted comment]
GoogleCodeExporter commented 9 years ago
[deleted comment]
GoogleCodeExporter commented 9 years ago
As far as I can see the only way to get a *corrupted* tree (as opposed to a 
suboptimal one) is that two nodes should have a common ancestor but do not 
(e.g. because thread1 sets a.parent = b and thread2 (not seeing thread1's 
change) sets a.parent = c.

There is no danger of spurious merges, because if two partitions are *not* 
meant to be merged then the merge method never sees pointers to both at once.

All you need to ensure is that before a.merge(b) returns, a and b have a common 
ancestor (and that all threads agree on the fact.)  Anything else is just 
optimization.  I cannot see any way in which this constraint could be violated 
that would not be solved by a compareAndSet() loop.

E.g. this should work:

    public ConcurrentPartition merge(ConcurrentPartition other) {
        ConcurrentPartition root1, root2;
        while ((root1 = other.find()) != (root2 = find())) {
            if (root1.rank < root2.rank) {
                if (root1.parentRef.compareAndSet(root1, root2)) {
                    return root2;
                }
            } else if (root1.rank > root2.rank) {
                if (root2.parentRef.compareAndSet(root2, root1)) {
                    return root1;
                }
            } else {
                if (root2.parentRef.compareAndSet(root2, root1)) {
                    root1.rank++;
                    return root1;
                }
            }
        }
        return root2;
    }

The worst that could happen is that each thread only performs only one call to 
merge() and no thread sees any other thread's updates to rank (easily simulated 
by making rank ThreadLocal).  Then you end up with a singly-linked list, but 
that would still give you the right answers for equals() and hashCode() and the 
path compression would still get a chance to reduce the depth.

You might need global synchronization at a higher level, to ensure that all 
merge()s have completed before you start performing comparisons with equals() 
and hashCode(), but that does not affect the implementation of merge().

Feel free to prove me wrong.

Original comment by fin...@gmail.com on 22 May 2011 at 12:14

GoogleCodeExporter commented 9 years ago
A.rank is 1
B.rank is 0
C.rank is 1

Thread1:
merge(A, B) - computes that B.parent should point to A

Thread2:
merge(B, C) - computes that B.parent should point to C

If these ran concurrently, each thread is free to write its own opinion,  
ignoring (perhaps without even seeing, due to lack of visibility guarantees) 
the other thread's update. Eg the 'second' thread would update B thinking it 
the root, instead of updating the new root.

Original comment by jim.andreou on 22 May 2011 at 2:54

GoogleCodeExporter commented 9 years ago
No problem there with the compareAndSet implementation:

Thread1 computes that B.parent should point to A (and the previous root is B)
Thread2 computes that B.parent should point to C (and the previous root is B)

Now the following are executed in some order:

Thread1: B.parentRef.compareAndSet(B, A)
Thread2: B.parentRef.compareAndSet(B, C)

At least one will fail, and will go round the loop again & find the new root
e.g. if Thread2 was attempting merge(B, C) it will now attempt merge(A, C) 
instead.
The result will be
  A
 / \
B   C
which is correct.

What appears to be more problematic is:
A.rank == B.rank

Thread1: merge(A, B)
Thread2: merge(B, A)

Thread 1 determines that A.parent should point to B, while thread 2 determines 
that B.parent should point to A.

Now both threads perform their updates, creating a cycle (compareAndSet does 
not prevent this, since the pointers are not synchronized with each other.)

But even then, the find() loop terminates (thanks to the compression step.)  I 
have tested this with random interleaving (i.e. inserting sleeps with random 
delays) and still always got the right answer so far.

Original comment by fin...@gmail.com on 22 May 2011 at 3:19

GoogleCodeExporter commented 9 years ago
OK I found a way to break that method.

First create a cycle [A->B->C->D->A], e.g. by calling {A.merge(B); B.merge(C); 
C.merge(D); D.merge(A)} in parallel.

Now call A.find() in thread1 and D.find() in thread2.
Here is one possible execution order:

- thread1 observes A->B->C
- thread2 observes D->A->B
- thread1 sets A.parent = C
- thread1 observes B->C->D
- thread1 sets B.parent = D
- thread1 observes C->D->A
- thread1 sets C.parent = A
- thread2 sets D.parent = B
- thread2 observes A->C->A
- thread2 sets A.parent = A and returns A
- thread1 observes D->B->D
- thread1 sets D.parent = D and returns D

Now the cycle has been split into two (C->A and B->D)

Original comment by fin...@gmail.com on 22 May 2011 at 9:22

GoogleCodeExporter commented 9 years ago
By the way, I'd like to share a curious usability issue of this class (well, 
not exactly of *this* class, as you'll see, but a wart nevertheless).

If a user object doesn't share the same lifecycle as its Partition object, then 
the user has to fallback into using a Map<E, Partition>. Now, if the user can 
iterate over all the elements beforehand, he can eagerly initialize that Map 
with Partition objects. But it is seems common to have a list of edges (element 
pairs) instead to be merged, so the user that ends up with Map<E, Partition>, 
will probably end up with lazily initializing that map too. The problem is we 
don't have a concise way to have a default-value'd map - 
MapMaker#makeComputingMap easily takes 5-6 lines. And the return type is not 
that satisfying: ConcurrentMap<E, Partition>. I played with the idea of adding 
a method "newPartitionMap()", returning a (computing) Map<E, Partition>, 
internally using the MapMaker with concurrencyLevel==1. But Map<E, Partition> 
is a poor type for a computing map. Function<E, Partition> (Kevin's suggestion) 
is more correct, in that it doesn't expose a surprizing Map, but then if you 
want to iterate in the end your E's (e.g. to see which partition they ended up 
into), code using a Function<E, Partition> would also be quite verbose, since 
an extra Set<E> would be needed.

I wonder what the community thinks of this issue.

Original comment by jim.andreou on 22 May 2011 at 9:38

GoogleCodeExporter commented 9 years ago
I'm leaning towards being conservative here: I think almost all of this 
discussion is getting ahead of itself.  I think the current version of 
Partition fulfills most users' needs -- I've actually used this version for 
several things myself without any further changes -- and I'm very strongly 
inclined to wait until actual use cases occur in the wild for concurrency 
support, lazy initialization, etc.

Original comment by wasserman.louis on 22 May 2011 at 11:56

GoogleCodeExporter commented 9 years ago
Completely agreed about concurrency. Regarding lazy initialization, well, I've 
seen it in three cases "in the wild" (of googleland), and at least two really 
required it (think an incoming stream of edges, of an unknown set of nodes - 
you don't want to buffer/iterate the edges twice to discover/initialize the 
nodes). 

Some people don't like the verbosity of user code doing a lazy initializing 
Map<E, Partition> either manually or by a computing map. Hard to satisfy this 
crowd. In my opinion, judging Partition negatively on this respect is rather 
unfair, since it isn't its job to offer a proper lazy-initializing map (imagine 
if any class that might be used together with a lazy-initializing map, was 
hard-coding one of these internally to shave few lines of user code - that 
would be solving the same problem again and again, all over the place, instead 
of once). Yet, there is some other internal union/find implementation that does 
just that (basically something like a Partition<E> together with a lazy 
initializing LinkedHashMap<E, Partition<E>>), and people prefer the resulting 
user code, effectively rewarding a bad data structure design choice :) such is 
life some times...

Original comment by jim.andreou on 23 May 2011 at 12:21

GoogleCodeExporter commented 9 years ago

Original comment by kevinb@google.com on 13 Jul 2011 at 6:18

GoogleCodeExporter commented 9 years ago

Original comment by kevinb@google.com on 16 Jul 2011 at 8:32

GoogleCodeExporter commented 9 years ago
We never got around to this.  Should it be revived?

Original comment by wasserman.louis on 20 Sep 2011 at 6:50

GoogleCodeExporter commented 9 years ago
Oh. I had a dormant change list, forgot about it. The main problem in user code 
is not related to Partition - it's that it's just Too Hard (TM) to make a 
default value map in java (even with guava, unless that default value happens 
to be a collection). 

To take some verbosity out, basically, I meant to introduce a 
Partition#newPartitionMap(), returning a Map<E, Partition>, where the user 
wouldn't do the little "is the key in the map? if not, add this entry" dance. 

Kevin had proposed a Function<E, Partition> instead of default-valued map, but 
this wouldn't do since a common operation turned out to be accessing the 
keySet() of said map.

It was a complicated discussion, taking more time than it was worth it, so I 
put it in the backburner.

Original comment by andr...@google.com on 20 Sep 2011 at 6:57

GoogleCodeExporter commented 9 years ago
I actually have some thoughts on this I will try to pull together soon.

Original comment by kevinb@google.com on 21 Sep 2011 at 1:35

GoogleCodeExporter commented 9 years ago

Original comment by fry@google.com on 10 Dec 2011 at 4:01

GoogleCodeExporter commented 9 years ago
Proposal:

Introduce a Partitioning<E> class that manages the "default-valued map" details 
internally.  If we want to support key-set iteration, we can do that, but 
manage it internally.

Partitioning<E> supports Partition getPartition(E e), which returns an object 
of the Partition type we'd been playing with earlier.  Additionally supports 
inSamePart(E, e), and all of that.

For the moment, Partitioning locks the whole thing when doing any mutations.  
If we design a more effective concurrent version of the structure later, then 
that's great, but we guarantee thread-safety now (with a qualifier about 
performance) and try to optimize later.

Original comment by wasserman.louis on 17 Jan 2012 at 3:54

GoogleCodeExporter commented 9 years ago
For reference: Wikipedia links to 
http://citeseer.ist.psu.edu/anderson94waitfree.html, "Wait-free Parallel 
Algorithms for the Union Find Problem."  It sounds, well, pretty much exactly 
like what we wanted?

Original comment by wasserman.louis on 17 Jan 2012 at 4:34

GoogleCodeExporter commented 9 years ago

Original comment by kevinb@google.com on 30 May 2012 at 7:43