sagemath / sage

Main repository of SageMath
https://www.sagemath.org
Other
1.47k stars 488 forks source link

G.triangles_count speedup #18250

Closed 6bdad4c1-1e26-4f2f-a442-a01a2292c181 closed 9 years ago

6bdad4c1-1e26-4f2f-a442-a01a2292c181 commented 9 years ago

I thought that we had a decent implementation of G.triangles_count. Turns out that we did not.

Here is what the branch does:

Overall, comparing default implementations:

Before

sage: g = graphs.RandomGNP(500,.5)
sage: %timeit g.triangles_count()
1 loops, best of 3: 9.78 s per loop

After

sage: g = graphs.RandomGNP(500,.5)
sage: %timeit g.triangles_count()
10 loops, best of 3: 162 ms per loop

CC: @sagetrac-azi @videlec @dcoudert

Component: graph theory

Author: Nathann Cohen

Branch/Commit: fd88c98

Reviewer: Vincent Delecroix

Issue created by migration from https://trac.sagemath.org/ticket/18250

6bdad4c1-1e26-4f2f-a442-a01a2292c181 commented 9 years ago

Commit: e780f71

6bdad4c1-1e26-4f2f-a442-a01a2292c181 commented 9 years ago

New commits:

e780f71trac #18250: G.triangles_count speedup
6bdad4c1-1e26-4f2f-a442-a01a2292c181 commented 9 years ago

Description changed:

--- 
+++ 
@@ -33,11 +33,15 @@
 Before

- +sage: g = graphs.RandomGNP(500,.5) +sage: %timeit g.triangles_count() +1 loops, best of 3: 9.78 s per loop


 After

- +sage: g = graphs.RandomGNP(500,.5) +sage: %timeit g.triangles_count() +10 loops, best of 3: 162 ms per loop

6bdad4c1-1e26-4f2f-a442-a01a2292c181 commented 9 years ago

Branch: public/18250

7ed8c4ca-6d56-4ae9-953a-41e42b4ed313 commented 9 years ago

Changed commit from e780f71 to ed8bb3c

7ed8c4ca-6d56-4ae9-953a-41e42b4ed313 commented 9 years ago

Branch pushed to git repo; I updated commit sha1. New commits:

ed8bb3cTrac 18250: documentation
videlec commented 9 years ago
comment:3

Hello,

Small commit for documentation at public/18250.

The algorithm 'iter' returns int while dense_copy returns Integer. This would better be uniform.

You always perform a copy?! Isn't it possible to have a graph whose backend is already a static_sparse_graph or a static_dense_graph?

Vincent

7ed8c4ca-6d56-4ae9-953a-41e42b4ed313 commented 9 years ago

Changed commit from ed8bb3c to fa6115d

7ed8c4ca-6d56-4ae9-953a-41e42b4ed313 commented 9 years ago

Branch pushed to git repo; I updated commit sha1. New commits:

fa6115dtrac #18250: Integer return type and avoid useless copies
6bdad4c1-1e26-4f2f-a442-a01a2292c181 commented 9 years ago
comment:5

Yooooooooooooooo,

Small commit for documentation at public/18250.

Oh, right. Thanks.

The algorithm 'iter' returns int while dense_copy returns Integer. This would better be uniform.

Done.

You always perform a copy?! Isn't it possible to have a graph whose backend is already a static_sparse_graph or a static_dense_graph?

Well, in this case copies are eally cheap, so I personally prefer a simpler code to one that avoids that copy.

Still, I implemented it. There is a 'trick' somewhere:

1) I first build a copy of the graph with G.copy(immutable=True) (which makes no copy if the graph is already immutable), then access its internal data structure

2) If I do g[0] = G.copy(immutable=True)._backend._cg.g[0] I get a segfault because G.copy() is immediately deallocated. So I first do G = G.copy(immutable=True)._backend, and then access its internal data.

Fortunately the copy G is not deleted until the end of the function. If Cython ever notices that (as far as it is concerned) it is never used again in the function, then it will be deallocated and we will get a segfault again.

Nathann

videlec commented 9 years ago
comment:6

Replying to @nathanncohen:

You always perform a copy?! Isn't it possible to have a graph whose backend is already a static_sparse_graph or a static_dense_graph?

Fortunately the copy G is not deleted until the end of the function. If Cython ever notices that (as far as it is concerned) it is never used again in the function, then it will be deallocated and we will get a segfault again.

Cython does not perform deallocation, it is Python job. Though Cython does increment/decrement the reference counter. And precisely, doing G = G.copy(immutable=True) precisely increment the reference counter!

Looks good! I am running the tests...

Vincent

6bdad4c1-1e26-4f2f-a442-a01a2292c181 commented 9 years ago
comment:7

Cython does not perform deallocation, it is Python job.

Well if you prefer, but this code is translated into C, and gcc may notice that the object is totally useless. They must have done something to keep those Python variables until the end.

Nathann

videlec commented 9 years ago
comment:8

Replying to @nathanncohen:

Cython does not perform deallocation, it is Python job.

Well if you prefer, but this code is translated into C, and gcc may notice that the object is totally useless. They must have done something to keep those Python variables until the end.

Nope. gcc will not do that. An affectation G = H is not one instruction if H is a Python object. Cython translates it into something like

cdef PyObject * G = H;
Py_INCREF(G);

whick prevents deallocation of H until G is garbage collected (typically at the end of the function or until a call of del G).

Vincent

videlec commented 9 years ago

Reviewer: Vincent Delecroix

videlec commented 9 years ago
comment:9

I do not like the default choice being sparse copy. Especially if G is implemented as a static dense graph! Wouldn't it be better to have a default algorithm=None and be more clever about choosing the right algorithm?

6bdad4c1-1e26-4f2f-a442-a01a2292c181 commented 9 years ago
comment:10

I don't think that anybody in Sage plays with data structure except me, in Sage's subfunctions. I can do it, but I am rather convinced that it is useless. It may be useful in one or two years if I can rewrite and clean all this stuff.

Will do.

Nathann

videlec commented 9 years ago
comment:11

Replying to @nathanncohen:

I don't think that anybody in Sage plays with data structure except me, in Sage's subfunctions. I can do it, but I am rather convinced that it is useless. It may be useful in one or two years if I can rewrite and clean all this stuff.

Ok. Let say something based on the density.

6bdad4c1-1e26-4f2f-a442-a01a2292c181 commented 9 years ago
comment:12

Ok. Let say something based on the density.

O_o

Nononono. That's for the user to decide. I don't want ugly density>0.3 which may only make sense on a specific architecture, and on the kind of graphs that were used for the tests.

7ed8c4ca-6d56-4ae9-953a-41e42b4ed313 commented 9 years ago

Branch pushed to git repo; I updated commit sha1. New commits:

fd88c98trac #18250: algorithm=None by default and better 'wrong values' check
7ed8c4ca-6d56-4ae9-953a-41e42b4ed313 commented 9 years ago

Changed commit from fa6115d to fd88c98

videlec commented 9 years ago
comment:14

Hello,

The method matrix does work for non simple graphs, right? (after removing the loops)

Vincent

6bdad4c1-1e26-4f2f-a442-a01a2292c181 commented 9 years ago
comment:15

The method matrix does work for non simple graphs, right? (after removing the loops)

After removing loops and multiple edges, yes.

Nathann

videlec commented 9 years ago
comment:16

Replying to @nathanncohen:

The method matrix does work for non simple graphs, right? (after removing the loops)

After removing loops and multiple edges, yes.

Please, consider the following right answers

sage: G = Graph([(0,1),(0,1),(1,2),(2,0)], multiedges=True)
sage: (G.adjacency_matrix()**3).trace() / 6
2
sage: sage: G = Graph([(0,1),(0,1),(1,2), (1,2), (2,0)], multiedges=True)
sage: (G.adjacency_matrix()**3).trace()//6
4
6bdad4c1-1e26-4f2f-a442-a01a2292c181 commented 9 years ago
comment:17

What do you mean?

videlec commented 9 years ago
comment:18

Replying to @nathanncohen:

What do you mean?

That the method algorithm=matrix is valid for non simple graphs (once you removed the loops).

6bdad4c1-1e26-4f2f-a442-a01a2292c181 commented 9 years ago
comment:19

Depends what you call a triangle. For me it is a set of three points adjacent to each other.

6bdad4c1-1e26-4f2f-a442-a01a2292c181 commented 9 years ago
comment:20

If it is a 'closed walk of length 3' then you should keep the loops inside.

videlec commented 9 years ago
comment:21

Replying to @nathanncohen:

If it is a 'closed walk of length 3' then you should keep the loops inside.

It is not. It is C3: three vertices and three edges! The edges are part of the data.

6bdad4c1-1e26-4f2f-a442-a01a2292c181 commented 9 years ago
comment:22

Ahahah. So if you have three vertices and 4 edges in between, then it is not a triangle? No problem, yet another definition

Nathann

videlec commented 9 years ago
comment:23

Replying to @nathanncohen:

Ahahah. So if you have three vertices and 4 edges in between, then it is not a triangle? No problem, yet another definition

No. It is not a triangle. But it does contain many triangles. It is very much like on the sphere. For each (non degenerate) triple of points you have 4 triangles whose vertices are exactly these points.

6bdad4c1-1e26-4f2f-a442-a01a2292c181 commented 9 years ago
comment:24

See, I am extremely aware of this kind of problems. There is one thousand different ways of defining stuff, and everybody has his own notion of it, depending on where you come from. I'm pretty sure that guys from word theory/automata would think that a triangle is just "a path of length three that leads you back to where you started" (that corresponds to the 'closed walk' I mentionned above).

I don't want people to believe stuff and end up with something else than what they had in mind. If two different persons can expect different result from the same function, I want either none of them or a way to force them to tell me what exactly they expect.

6bdad4c1-1e26-4f2f-a442-a01a2292c181 commented 9 years ago
comment:25

Oh, and then you will also have the guys who have a weight on their edges and want to count the number of weighted triangles.

videlec commented 9 years ago
comment:26

Replying to @nathanncohen:

Oh, and then you will also have the guys who have a weight on their edges and want to count the number of weighted triangles.

This is what I am thinking about when I consider a multi graph, it is just a graph with (positive) integer labels. The weight of a triangle is the product of the labels on the three edges. And to count triangles, you just sum the weights. It is what you get from the matrix trace (when you removed the loops).

But I agree that there is one ambiguity: you might want to count induced C3.

Vincent

videlec commented 9 years ago
comment:27

Replying to @nathanncohen:

I'm pretty sure that guys from word theory/automata would think that a triangle is just "a path of length three that leads you back to where you started" (that corresponds to the 'closed walk' I mentionned above).

Some people have wrong ideas. I know.

6bdad4c1-1e26-4f2f-a442-a01a2292c181 commented 9 years ago
comment:28

Some people have wrong ideas. I know.

And in my infinite Christ-like generosity I still care for them, and they shouldn't get wrong results just because they are sinners.

Nathann

videlec commented 9 years ago
comment:29

More seriously, I would prefer if the triangles_count in static_dense_graph would only accept backend being static_dense_graph (and move the copy step inside the generic_graph). Would make more sense. But that's very minor. I just want to know your feelings about how you would handle specificities of the backends in the future.

6bdad4c1-1e26-4f2f-a442-a01a2292c181 commented 9 years ago
comment:30

More seriously, I would prefer if the triangles_count in static_dense_graph would only accept backend being static_dense_graph (and move the copy step inside the generic_graph)

Well.. generic_graph.py cannot handle C types so it would have required to move the function to generic_graph_pyx.pyx (what a mess >_<), but I think that I was also influenced by the function is_strongly_regular immediately above.

Also, at first this function was building a copy of the graph every time, because I really did not think that it mattered. There are many functions like that in distance_all_pairs too. It contains every function which "relies heavily on the distance_all_pairs functions", and most of them accept Sage graphs directly (though some of them have pure C couterparts).

I just want to know your feelings about how you would handle specificities of the backends in the future.

Can you tell me what you mean by "specificities of the backends"?

Nathann

6bdad4c1-1e26-4f2f-a442-a01a2292c181 commented 9 years ago
comment:31

(Thanks for the review btw!)

6bdad4c1-1e26-4f2f-a442-a01a2292c181 commented 9 years ago
comment:32

Can you tell me what you mean by "specificities of the backends"?

If you actually mean "specificities of the backens" (and not about the Multi,looped,edge-labelled) graph), then what I want to do is merely to make it simple to switch from one to the other, for the user or for the coders. I am trying to clean the constructor first (#18185 in needs_review), then I will try to make it a bit faster (creating immutable copies is unnecessarily complicated), to distribute them a bit (into subfunctions).

The main problem I haven't solved is this: what should the default backend for Sage graphs be?

For small graphs I expect that the current one is right. But being able to run "has_edge" in log(n) time, or being able to remove edges has a huge cost. If we just want to add edges and vertices (i.e. to build graphs, from an algorithm or from a file) then it can be MUCH cheaper. For analysis of large networks it is something that you cannot afford to pay, and then this backend must be different.

So if we want to make no choice at all, it really has to become much simpler and natural to pick a backend for a graph.

Also, we may want to have a 'virtual' backend like in Gap for graphs defined from a group action. No graph would actually be stored, yet all algorithms should be available.

Well, that's what I think of these days.

Nathann

vbraun commented 9 years ago

Changed branch from public/18250 to fd88c98