Open videlec opened 4 years ago
Changed keywords from none to thursdaysbdx
So far, G.edges()
returns a sorted list of edges. So the test E == F
is simply a lexicographic comparison of lists, omitting the time to extract and sort the list of edges. The reporting time comparison is therefore not completely fair.
Furthermore, since the switch to Python3, we can no longer rely on sorted list of edges (vertices and edge labels may be of different types, leading to an error when trying to sort the list). One objective is to deprecate this behavior.
I think that the only way to speed up the test G == H
, is to speed up the test other.has_edge(...)
.
The reason why I created this ticket is that I have code which creates a big list of graphs and where I want to remove duplicates. Currently, the program spends more than 50% of its time in graph equality which is ridiculous.
I agree. I recently made a #30510 to speed up method subgraph
which was ridiculously slow. It's better now.
Here, I don't know how to speed up the method without going deep into the backends...
I am fine with "going deep into the backends". My graphs have vertices {0, 1, ..., n-1}
and there is no multiple edge. The weights are integers (but I don't think this is taken into account in the backend). The comparison of such graphs should be faster than the generic Python list comparison.
Already the edge iterator is pretty slow. It takes 76 µs out 175 for me.
for your code, it could be better to work with immutable graphs.
sage: sage: E = [(i, (i+85)%100) for i in range(100)] + [(i, (i+37)%100) for i in range(100)] + [(i, (i+85)%100) for i in range(100)]
sage: G = Graph([range(100), E], format='vertices_and_edges', loops=False, multiedges=False, immutable=True)
sage: H = G.copy()
sage: %timeit G == H
131 µs ± 1.78 µs per loop (mean ± std. dev. of 7 runs, 10000 loops each)
sage: sage: G = Graph(loops=False, multiedges=False)
....: sage: G.add_edges([(i, (i+85)%100) for i in range(100)])
....: sage: G.add_edges([(i, (i+37)%100) for i in range(100)])
....: sage: G.add_edges([(i, (i+85)%100) for i in range(100)])
....: sage: H = G.copy()
....: sage: %timeit G == H
....:
241 µs ± 8.24 µs per loop (mean ± std. dev. of 7 runs, 1000 loops each)
But of course it would be better to improve the backends to get faster edge iterator, edge membership tests, equality test, etc.
Description changed:
---
+++
@@ -13,3 +13,14 @@
sage: %timeit E == F
3.3 µs ± 5.33 ns per loop (mean ± std. dev. of 7 runs, 100000 loops each)
+Using immutable graphs is better by a factor 2 (and hence "only" 30x slower than list comparisons)
+
+ +sage: E = [(i, (i+85)%100) for i in range(100)] + \ +....: [(i, (i+37)%100) for i in range(100)] + \ +....: [(i, (i+85)%100) for i in range(100)] +sage: G = Graph(E, loops=False, multiedges=False, immutable=True) +sage: H = G.copy() +sage: %timeit G == H +111 µs ± 2.01 µs per loop (mean ± std. dev. of 7 runs, 10000 loops each) +
I wrote a quick dirty solution that counts compares two vertices at a time and counts how much the out arcs differ. This takes about 34 µs instead of 175 using out_arc_unsafe
.
I will start with a tiny ticket that does some trivial optimizations for has_edge
.
Dependencies: #30645, #30665
Things are set up in a way, that we can implement an optimized containment check in CGraph
.
Nice. Thanks Jonathan for your efforts!
I also wonder if we could implement a comparison that would bypass the translation between "vertex labels" and "integers in {0, ..., n-1}". Namely have a comparison of the internal representations (assuming the backend is the same).
Does it make sense ? Currently, we check that the graphs have same settings and sets of vertices and edges. However, the internal representation might differ a lot. Suppose one graph G is the result of many additions/deletions of vertices of edges. Typically, at some point it has 100000 vertices, but at the end of the sequence of operations, it remains only 2. Internal bitsets/data structures might be very large. Now, if H is a copy of G, it has a very compact internal representation. The 2 graphs are equal, but have different internal representations.
That is one use case. Here is another. I am considering all trivalent graphs up to isomorphism on n
vertices. I want to construct the flip graph where I put an edge between the trivalent graph G1
and trivalent graph G2
if they differ by a Whitehad move
A C A C
\ / \ /
\ / \ /
u -- v -----------> u -- v
/ \ / \
/ \ / \
B D D B
In order to generate this flip graph, one has to be able to identify the graph after a move. Since canonical labels are computed we need to compare graphs on {0, ..., n-1}
with no surprise... and these are a lot of comparisons!
I see, not easy to do it efficiently.
To check whether one CGraph
is contained in another one can store a translation array at the beginning. If you can guarantee that the vertices match, you can of course skip this step.
After this initial step is pretty much the same as #30665. Just that you check has_arc_unsafe
of other instead of yield. Things are a bit more complicated for multiple edges.
If you can guarantee the vertices to match, things should be really fast for dense graphs, but of course only, if you are somewhat dense (e.g. if n <= 64
). Here you can just immediately compare to vertices.
Changed dependencies from #30645, #30665 to #30645, #30665, #30776
Things are improving, I think.
IMO what would still improve things here:
int
,range(n)
(we could still have active vertices or not, but all vertices need to be in range(n)
, if you add a new vertex that is too large, you will get a memory problem, but that is your problem),is_subgraph
and subgraph_given_vertices
for dense graphs over range(n)
(we need a good name for this graph backend yet), this should be much faster than the current methodsSo instead of just CGraphBackend
we would have CGraphBackend
, CGraphVertices
and CGraphRangeVertices
(not good names yet). DenseGraphBackend
as we know it would have to inherit from CGraphBackend
and CGraphVertices
and DenseGraphBackendRangeVertices
would have to inherit from CGraphBackend
and CGraphRangeVertices
. Likewise for sparse and static sparse.
Does this sound plausible?
Does anyone have good suggestions for a naming scheme?
Btw, dense graphs should really use bitsets. I don't understand why it isn't. Eventually this might be sped up with intrinsics. E.g. an iterator over a bitset should be much faster with an improved leading zero count (_lzcnt_u64
).
I don't know if it's better to use bitsets for dense graphs. In case, we can use src/sage/data_structures/binary_matrix.<pxi|pxd>
.
Replying to @kliem:
So instead of just
CGraphBackend
we would haveCGraphBackend
,CGraphVertices
andCGraphRangeVertices
(not good names yet).DenseGraphBackend
as we know it would have to inherit fromCGraphBackend
andCGraphVertices
andDenseGraphBackendRangeVertices
would have to inherit fromCGraphBackend
andCGraphRangeVertices
. Likewise for sparse and static sparse.Does this sound plausible?
Does anyone have good suggestions for a naming scheme?
This is how DisjointSet
(union-find data structure) is implemented in Sage.
def DisjointSet(arg):
cdef class DisjointSet_class(SageObject):
cdef class DisjointSet_of_integers(DisjointSet_class):
cdef class DisjointSet_of_hashables(DisjointSet_class):
in the sense that DisjointSet_of_integers
uses directly the internal representation and DisjointSet_of_hashables
uses maps from integers in range(n) to the hashable objects and vice versa. See https://github.com/sagemath/sage-prod/blob/develop/src/sage/sets/disjoint_set.pyx
I don't know if the naming scheme _of_integers
and _of_hashables
I chosen at the time is good or not.
Timing with 9.3.beta5 on a macbook air. It seems we are much better now !
sage: G = Graph(loops=False, multiedges=False)
sage: G.add_edges([(i, (i+85)%100) for i in range(100)])
sage: G.add_edges([(i, (i+37)%100) for i in range(100)])
sage: G.add_edges([(i, (i+85)%100) for i in range(100)])
sage: H = G.copy()
sage: %timeit G == H
43.1 µs ± 821 ns per loop (mean ± std. dev. of 7 runs, 10000 loops each)
sage:
sage: E = list(G.edges())
sage: F = list(H.edges())
sage: %timeit E == F
12.4 µs ± 357 ns per loop (mean ± std. dev. of 7 runs, 100000 loops each)
sage:
sage: E = [(i, (i+85)%100) for i in range(100)] + \
....: [(i, (i+37)%100) for i in range(100)] + \
....: [(i, (i+85)%100) for i in range(100)]
sage: G = Graph(E, loops=False, multiedges=False, immutable=True)
sage: H = G.copy()
sage: %timeit G == H
31 µs ± 1.63 µs per loop (mean ± std. dev. of 7 runs, 10000 loops each)
Setting new milestone based on a cursory review of ticket status, priority, and last modification date.
On sage 9.1
Using immutable graphs is better by a factor 2 (and hence "only" 30x slower than list comparisons)
Depends on #30645 Depends on #30665 Depends on #30776
CC: @kliem @dimpase @dcoudert
Component: graph theory
Keywords: thursdaysbdx
Issue created by migration from https://trac.sagemath.org/ticket/30641