Closed mschauer closed 1 year ago
It appears this issue is not as bad as I first thought.
In principle, the problem occurs if we have a vertex $x$ with a lot of non-compelled incoming edges and a lot of children $y$ (say each with in-degree 1, that is only having $x$ as parent).
Assume $x$ has $k$ non-compelled incoming edges and there are $l$ children of $x$. Then, we have a cost of $k \cdot l$ and only $k + l$ edges. Thus, its not clear the run-time is bounded by the number of edges anymore.
The thing is to get $k$ non-compelled incoming edges, those $k$ parent vertices have to form a clique. So we get another $O(k^2)$ edges and the above worst-case of doesn't seem to occur.
However, if you choose $k = O(\sqrt{l})$, we have $k\cdot k + k + l = O(l)$ edges, but run-time $k\cdot l = O(l \cdot \sqrt{l})$. So, the run-time can be $O(m \cdot \sqrt{m})$ for some instances, but I guess it can't be much worse than that.
function gen_graph(k)
g = SimpleDiGraph(k*k + k + 1)
for i = 1:k, j = (i+1):k
add_edge!(g, i, j)
end
for i = 1:k
add_edge!(g, i, k+1)
end
for i = 1:k*k
add_edge!(g, k+1, i+k+1)
end
return g
end
Called for $k=1000$ it seems noticably slower than an implementation, which only iterates over compelled parents (e.g. alt_cpdag(g) here: https://github.com/mwien/CausalInference.jl/blob/idalgorithms/src/cpdag.jl).
There is another catch though: the real problem is topological_sort_by_dfs
from the Julia Graphs Package. Unfortunately, the run-time of this function can be $O(n^2)$ even for sparse graphs. So I had to first replace this with a quick implementation of my own (https://github.com/mwien/CausalInference.jl/commit/e5ee3b033add48b32207236be857e32260a79a30).
Is this a problem with topological_sort_by_dfs
or our use case, in other words should we open an issue on Graphs.jl
?
We should open an issue. I'm trying to pinpoint the exact problem at the moment.
function gen_graph_dfs(k)
g = SimpleDiGraph(2*k+1)
for i = 1:k
add_edge!(g, i, 2*k+1)
add_edge!(g, 2*k+1, k+i)
end
return g
end
This is an even simpler example. E.g. if you run with k = 10^5 (that will give a 2*10^5+1 node graph) and then call topological_sort_by_dfs
it already takes a few seconds.
I think I got it now. This is the simplest possible example I think is:
function gen_graph_dfs(k)
g = SimpleDiGraph(k+1)
for i = 1:k
add_edge!(g, 1, i+1)
end
return g
end
The problem is that this loop (https://github.com/JuliaGraphs/Graphs.jl/blob/e5405d105e034ccb9396d30f86c26641ce94c504/src/traversals/dfs.jl#L97) goes through the neighbors of u over and over everytime starting at the first.
So it will find n with vcolor[n] == 0 and break, then (with some further steps vcolor[n] will be set to 2). Afterwards, again the search over the neighbors of u starts at the first one until it finds one with vcolor 0 and breaks etc.
Hence, for a vertex with $k$ neighbors its $k^2$ and hence in total it can be $O(m^2)$ like for the graph above...
So, in summary topological_sort_by_dfs
can be $O(m^2)$, whereas the function cpdag
is "only" $O(m\cdot\sqrt{m})$ in the worst-case (assuming an $O(m)$ implementation of topological sort).
Do you want to open the issue or should I do that?
I'm not so experienced, would be glad to get some help or have you open the issue directly. What label should the issue have? "bug" or something else?
Write something "Possible issue with topological_sort_by_dfs" and explain what you saw and then they can figure out if
Looks like the topological_sort_by_dfs can have a run-time O(n*n) even for sparse graphs. This is the simplest possible example is
function gen_graph_dfs(k)
g = SimpleDiGraph(k+1)
for i = 1:k
add_edge!(g, 1, i+1)
end
return g
end
The problem is that this loop (https://github.com/JuliaGraphs/Graphs.jl/blob/e5405d105e034ccb9396d30f86c26641ce94c504/src/traversals/dfs.jl#L97) goes through the neighbors of u over and over everytime starting at the first.
So it will find n
with vcolor[n] == 0
and break, then (with some further steps vcolor[n]
will be set to 2). Afterwards, again the search over the neighbors of u starts at the first one until it finds one with vcolor 0 and breaks etc.
Hence, for a vertex with $k$ neighbors its $k^2$ and hence in total it can be $O(m^2)$ like for the graph above...
Thanks, I submitted here: https://github.com/JuliaGraphs/Graphs.jl/issues/263
Appreciated!
Check if iteration of
instead of only iterating over the compelled incoming edges of x could lead to worse asymptotic run-time than O(n+m). @mwien
https://github.com/mschauer/CausalInference.jl/blob/57c716fd5ce84137efb8dd147c698dd5646e91e7/src/cpdag.jl#L110C1-L110C1