sagemath / sage

Main repository of SageMath
https://www.sagemath.org
Other
1.41k stars 476 forks source link

equality of graphs is 60 times slower than equality of their list of edges #30641

Open videlec opened 4 years ago

videlec commented 4 years ago

On sage 9.1

sage: G = Graph(loops=False, multiedges=False)
sage: G.add_edges([(i, (i+85)%100) for i in range(100)])
sage: G.add_edges([(i, (i+37)%100) for i in range(100)])
sage: G.add_edges([(i, (i+85)%100) for i in range(100)])
sage: H = G.copy()
sage: %timeit G == H
196 µs ± 708 ns per loop (mean ± std. dev. of 7 runs, 1000 loops each)
sage: E = list(G.edges())
sage: F = list(H.edges())
sage: %timeit E == F
3.3 µs ± 5.33 ns per loop (mean ± std. dev. of 7 runs, 100000 loops each)

Using immutable graphs is better by a factor 2 (and hence "only" 30x slower than list comparisons)

sage: E = [(i, (i+85)%100) for i in range(100)] + \
....:     [(i, (i+37)%100) for i in range(100)] + \
....:     [(i, (i+85)%100) for i in range(100)]
sage: G = Graph(E, loops=False, multiedges=False, immutable=True)
sage: H = G.copy()
sage: %timeit G == H
111 µs ± 2.01 µs per loop (mean ± std. dev. of 7 runs, 10000 loops each)

Depends on #30645 Depends on #30665 Depends on #30776

CC: @kliem @dimpase @dcoudert

Component: graph theory

Keywords: thursdaysbdx

Issue created by migration from https://trac.sagemath.org/ticket/30641

videlec commented 4 years ago

Changed keywords from none to thursdaysbdx

dcoudert commented 4 years ago
comment:2

So far, G.edges() returns a sorted list of edges. So the test E == F is simply a lexicographic comparison of lists, omitting the time to extract and sort the list of edges. The reporting time comparison is therefore not completely fair.

Furthermore, since the switch to Python3, we can no longer rely on sorted list of edges (vertices and edge labels may be of different types, leading to an error when trying to sort the list). One objective is to deprecate this behavior.

I think that the only way to speed up the test G == H, is to speed up the test other.has_edge(...).

videlec commented 4 years ago
comment:3

The reason why I created this ticket is that I have code which creates a big list of graphs and where I want to remove duplicates. Currently, the program spends more than 50% of its time in graph equality which is ridiculous.

dcoudert commented 4 years ago
comment:4

I agree. I recently made a #30510 to speed up method subgraph which was ridiculously slow. It's better now.

Here, I don't know how to speed up the method without going deep into the backends...

videlec commented 4 years ago
comment:5

I am fine with "going deep into the backends". My graphs have vertices {0, 1, ..., n-1} and there is no multiple edge. The weights are integers (but I don't think this is taken into account in the backend). The comparison of such graphs should be faster than the generic Python list comparison.

kliem commented 4 years ago
comment:6

Already the edge iterator is pretty slow. It takes 76 µs out 175 for me.

dcoudert commented 4 years ago
comment:7

for your code, it could be better to work with immutable graphs.

sage: sage: E = [(i, (i+85)%100) for i in range(100)] + [(i, (i+37)%100) for i in range(100)] + [(i, (i+85)%100) for i in range(100)]     
sage: G = Graph([range(100), E], format='vertices_and_edges', loops=False, multiedges=False, immutable=True)                                  
sage: H = G.copy()                                                                                                                  
sage: %timeit G == H                                                                                                                
131 µs ± 1.78 µs per loop (mean ± std. dev. of 7 runs, 10000 loops each)
sage: sage: G = Graph(loops=False, multiedges=False) 
....: sage: G.add_edges([(i, (i+85)%100) for i in range(100)]) 
....: sage: G.add_edges([(i, (i+37)%100) for i in range(100)]) 
....: sage: G.add_edges([(i, (i+85)%100) for i in range(100)]) 
....: sage: H = G.copy() 
....: sage: %timeit G == H 
....:                                                                                                                               
241 µs ± 8.24 µs per loop (mean ± std. dev. of 7 runs, 1000 loops each)

But of course it would be better to improve the backends to get faster edge iterator, edge membership tests, equality test, etc.

videlec commented 4 years ago

Description changed:

--- 
+++ 
@@ -13,3 +13,14 @@
 sage: %timeit E == F
 3.3 µs ± 5.33 ns per loop (mean ± std. dev. of 7 runs, 100000 loops each)

+Using immutable graphs is better by a factor 2 (and hence "only" 30x slower than list comparisons) + + +sage: E = [(i, (i+85)%100) for i in range(100)] + \ +....: [(i, (i+37)%100) for i in range(100)] + \ +....: [(i, (i+85)%100) for i in range(100)] +sage: G = Graph(E, loops=False, multiedges=False, immutable=True) +sage: H = G.copy() +sage: %timeit G == H +111 µs ± 2.01 µs per loop (mean ± std. dev. of 7 runs, 10000 loops each) +

kliem commented 4 years ago
comment:9

I wrote a quick dirty solution that counts compares two vertices at a time and counts how much the out arcs differ. This takes about 34 µs instead of 175 using out_arc_unsafe.

kliem commented 4 years ago
comment:10

I will start with a tiny ticket that does some trivial optimizations for has_edge.

kliem commented 4 years ago

Dependencies: #30645, #30665

kliem commented 4 years ago
comment:12

30645 and #30665 are a good start I would say.

Things are set up in a way, that we can implement an optimized containment check in CGraph.

videlec commented 4 years ago
comment:13

Nice. Thanks Jonathan for your efforts!

I also wonder if we could implement a comparison that would bypass the translation between "vertex labels" and "integers in {0, ..., n-1}". Namely have a comparison of the internal representations (assuming the backend is the same).

dcoudert commented 4 years ago
comment:14

Does it make sense ? Currently, we check that the graphs have same settings and sets of vertices and edges. However, the internal representation might differ a lot. Suppose one graph G is the result of many additions/deletions of vertices of edges. Typically, at some point it has 100000 vertices, but at the end of the sequence of operations, it remains only 2. Internal bitsets/data structures might be very large. Now, if H is a copy of G, it has a very compact internal representation. The 2 graphs are equal, but have different internal representations.

videlec commented 4 years ago
comment:15

That is one use case. Here is another. I am considering all trivalent graphs up to isomorphism on n vertices. I want to construct the flip graph where I put an edge between the trivalent graph G1 and trivalent graph G2 if they differ by a Whitehad move

 A          C                      A          C
  \        /                        \        /
   \      /                          \      /
    u -- v          ----------->      u -- v
   /      \                          /      \
  /        \                        /        \
 B          D                      D          B

In order to generate this flip graph, one has to be able to identify the graph after a move. Since canonical labels are computed we need to compare graphs on {0, ..., n-1} with no surprise... and these are a lot of comparisons!

dcoudert commented 4 years ago
comment:16

I see, not easy to do it efficiently.

kliem commented 4 years ago
comment:17

To check whether one CGraph is contained in another one can store a translation array at the beginning. If you can guarantee that the vertices match, you can of course skip this step.

After this initial step is pretty much the same as #30665. Just that you check has_arc_unsafe of other instead of yield. Things are a bit more complicated for multiple edges.

If you can guarantee the vertices to match, things should be really fast for dense graphs, but of course only, if you are somewhat dense (e.g. if n <= 64). Here you can just immediately compare to vertices.

kliem commented 4 years ago

Changed dependencies from #30645, #30665 to #30645, #30665, #30776

kliem commented 4 years ago
comment:18

Things are improving, I think.

IMO what would still improve things here:

So instead of just CGraphBackend we would have CGraphBackend, CGraphVertices and CGraphRangeVertices (not good names yet). DenseGraphBackend as we know it would have to inherit from CGraphBackend and CGraphVertices and DenseGraphBackendRangeVertices would have to inherit from CGraphBackend and CGraphRangeVertices. Likewise for sparse and static sparse.

Does this sound plausible?

Does anyone have good suggestions for a naming scheme?

kliem commented 4 years ago
comment:19

Btw, dense graphs should really use bitsets. I don't understand why it isn't. Eventually this might be sped up with intrinsics. E.g. an iterator over a bitset should be much faster with an improved leading zero count (_lzcnt_u64).

dcoudert commented 4 years ago
comment:20

I don't know if it's better to use bitsets for dense graphs. In case, we can use src/sage/data_structures/binary_matrix.<pxi|pxd>.

seblabbe commented 4 years ago
comment:21

Replying to @kliem:

So instead of just CGraphBackend we would have CGraphBackend, CGraphVertices and CGraphRangeVertices (not good names yet). DenseGraphBackend as we know it would have to inherit from CGraphBackend and CGraphVertices and DenseGraphBackendRangeVertices would have to inherit from CGraphBackend and CGraphRangeVertices. Likewise for sparse and static sparse.

Does this sound plausible?

Does anyone have good suggestions for a naming scheme?

This is how DisjointSet (union-find data structure) is implemented in Sage.

def DisjointSet(arg):
cdef class DisjointSet_class(SageObject):
cdef class DisjointSet_of_integers(DisjointSet_class):
cdef class DisjointSet_of_hashables(DisjointSet_class):

in the sense that DisjointSet_of_integers uses directly the internal representation and DisjointSet_of_hashables uses maps from integers in range(n) to the hashable objects and vice versa. See https://github.com/sagemath/sage-prod/blob/develop/src/sage/sets/disjoint_set.pyx

I don't know if the naming scheme _of_integers and _of_hashables I chosen at the time is good or not.

dcoudert commented 3 years ago
comment:23

Timing with 9.3.beta5 on a macbook air. It seems we are much better now !

sage: G = Graph(loops=False, multiedges=False) 
sage: G.add_edges([(i, (i+85)%100) for i in range(100)]) 
sage: G.add_edges([(i, (i+37)%100) for i in range(100)]) 
sage: G.add_edges([(i, (i+85)%100) for i in range(100)]) 
sage: H = G.copy() 
sage: %timeit G == H 
43.1 µs ± 821 ns per loop (mean ± std. dev. of 7 runs, 10000 loops each)
sage:
sage: E = list(G.edges()) 
sage: F = list(H.edges()) 
sage: %timeit E == F 
12.4 µs ± 357 ns per loop (mean ± std. dev. of 7 runs, 100000 loops each)
sage:
sage: E = [(i, (i+85)%100) for i in range(100)] + \ 
....:     [(i, (i+37)%100) for i in range(100)] + \ 
....:     [(i, (i+85)%100) for i in range(100)] 
sage: G = Graph(E, loops=False, multiedges=False, immutable=True) 
sage: H = G.copy() 
sage: %timeit G == H                                                                                                                                               
31 µs ± 1.63 µs per loop (mean ± std. dev. of 7 runs, 10000 loops each)
mkoeppe commented 3 years ago
comment:24

Setting new milestone based on a cursory review of ticket status, priority, and last modification date.